The Problem & Landscape
- Scale — 2B users × 1000+ posts = trillions of ops/day
- Personalisation depth — coarse categories don't work
- Real-time freshness — viral post from 5 min ago must rank now
- Multi-objective conflict — clicks ≠ satisfaction ≠ retention
- Feedback loop risk — model trains on data it generated
| System | Intent | Pool | Objective |
|---|---|---|---|
| Search | Explicit query | Query match | Relevance |
| Ads | Advertiser+ctx | Targeting | CTR × bid |
| Recsys | Inferred | Full catalog | Engagement |
| Feed | Identity+graph | Social graph | LT value |
- ~1000 candidates per feed refresh
- No explicit query — identity IS the query
- Feed has weakest intent signal of 4 systems
- Graph-constrained pool
- Why chronological fails → frequency ≠ relevance
- Why feed ≠ search → no explicit intent
- Why multi-objective needed → click ≠ satisfaction
The Math & Objective Function
- WHO (user): behavioral history, real-time session state, social graph strength, device
- WHAT (content): post type, engagement velocity, content embeddings, author quality
- HOW (context): time of day, device, session depth, geo
- CROSS (most powerful): user↔author affinity, topic match, format preference
- Dense — continuous numbers (post age, like count) → fed directly to net
- Sparse — categorical IDs (user ID, author ID) → must embed first
- Embedding tables at 2B users scale → TB size → distributed param servers
If user rarely does action k, then Y_ijtk ≈ 0 for that user across all posts. That term contributes near-zero to V automatically — no manual per-user weight zeroing needed. The formula personalises itself implicitly.
λ_k — training loss weights: how much each task shapes the shared layers during backprop.
w_k — ranking weights: how much each prediction contributes to V_ijt at serving time.
⚠ Don't confuse these — common interview mistake
- V_ijt formula — know every subscript
- w_k calibrated via surveys not engagement
- Negative signals (hide, report) enter with negative weights
- Dense → net directly; sparse → embedding lookup first
- Two separate weight sets: λ_k vs w_k
System Architecture
- Actions store — user events (likes, comments, hides). High write throughput, fast point lookups
- Objects store — post content + metadata. Bulk reads of 1000 posts
- Summary store — precomputed running aggregates. Updated in near-real-time. This is what makes real-time engagement features feasible.
- Unread bumping — posts ranked in previous session but never seen (user didn't scroll far) → re-eligible. May be higher quality than newer posts.
- Action bumping — posts user already saw that now have significant new activity (conversation broke out in comments) → re-surfaced as "comment-bumped."
- Integrity first → don't learn from violated content engagement
- Pass 0 → cost gate before expensive Pass 1. High recall, not precision
- Pass 1 independent → scores per post, no list context yet
- Pass 2 separate → diversity is list-level, can't do in isolation
- Summary store = precomputed running aggregates (not recount per request)
- Unread bumping + action bumping — know both mechanisms
- Pass 0: 1000→500, high recall goal, cheap model
- Pass 1: independent scoring, full neural net, main personalisation
- Pass 2: list-aware, diversity + dedup, cannot run in isolation
ML Models Deep Dive
- Label sparsity — "report" has few positives. Shared layers trained on union of all labels. Report head benefits from abundant like/comment signal.
- Shared structure — base user taste signal is shared across all tasks. Training K separate models relearns it K times wastefully.
- Inference cost — one shared forward pass, not K separate passes at serving time.
- User embeddings — behavioral taste profile. Similar users → close vectors.
- Content embeddings — semantic post meaning (text/image/video). Similar topics → close vectors.
- Author/page embeddings — content style + topic distribution.
- Two-tower — user + content in same space. Dot product = affinity.
- 2B users × 128 dims × 4 bytes = ~1TB for user table alone
- Can't sit on single machine → distributed parameter servers or sharded in-memory caches
- Embedding lookups are often the bottleneck at serving time
- Learned end-to-end via backprop through the ranking model
- Offline (batch) — train periodically on historical data. Stable, debuggable. Weakness: model is always stale.
- Online — weights updated continuously. Fresh but unstable, hard infra.
- Meta's approach: offline training + real-time feature updates. Get freshness from features, not weight updates.
Candidate Generation & Filtering
- Retrieval — recall-focused, fast, approximate. Can afford misses but not slowness.
- Ranking — precision-focused, expensive, exact. Adjudicates between retrieved candidates.
- ANN = approximate nearest neighbor. Trades small recall loss for massive speedup (O(N) → O(log N)).
- Removes spam, misinfo, hate speech, policy violations
- Must run before ranking — if violated content reaches users, gets engagement, enters training data → model learns to rank violations highly
- Also: don't waste ranking compute on content that can never be shown
- Integrity classifiers have their own recall/precision tradeoffs
- Goal: high recall, not high precision
- A missed great post at Pass 0 is invisible — the ranker never gets a chance
- A mediocre post that sneaks through Pass 0 gets demoted at Pass 1 — recoverable
- Threshold set conservatively. Better to keep too many than drop too few.
- Ranker quality bounded by candidate pool quality
- Two-tower: user tower + content tower → same embedding space → ANN
- ANN algorithms: FAISS, ScaNN, HNSW — O(log N) not O(N)
- Integrity runs before ranking (not after) — prevents training contamination
- Pass 0: recall over precision — false negatives are invisible
Training, Evaluation & Feedback Loops
- Position bias — higher positions get more engagement purely from position. Fix: IPS (reweight by 1/P(shown at position k)) or randomisation experiments.
- Selection bias — only observe engagement on shown content. Fix: randomisation experiments (random ordering for small % of users = unbiased labels).
- Conformity bias — social proof inflates already-popular content. Fix: use engagement rate not raw count; normalise by exposure.
The 2021 engineering post described a system designed around a social graph and multi-pass ranking pipeline. What's changed since, and how has the broader field moved? This section traces the evolution at Meta and across the industry — from TikTok's interest-graph challenge to foundation model integration to real-time infrastructure.
Evolution Since 2021 — State of the Art
| Axis | TikTok | Meta (2021→now) |
|---|---|---|
| Candidate source | Pure interest graph | Hybrid social+interest |
| Label quality | Watch time (clean, low social friction) | Likes (socially confounded) |
| Content format | Homogeneous short video | Heterogeneous (text/photo/video/reels) |
| Cold start | Random small sample → observe watch rate | Author history + content embeddings |
| Social graph | Not required | Primary → one signal among many |
- Content understanding (upstream) — unified multimodal embeddings (text+image+video in one space) replace separate task-specific classifiers. Cross-modal understanding.
- Semantic interest expansion — LLMs connect "marathon training" to "endurance nutrition" without explicit behavioral signal.
- Offline feature generation — LLM runs offline on all content → structured features (topic, tone, intent) → cached in X_ijt. No LLM at serving time.
- LLM re-ranking (emerging) — too expensive for all traffic, being explored for top-K slots only.
- Offline store — batch-computed, historical, for training
- Online store — low-latency serving, point-in-time correct
- Consistency — same feature logic for both → eliminates train-serve skew
- Streaming — Kafka → Flink → online store → sub-second freshness
- Examples: Uber Michelangelo, Airbnb Zipline, Feast (OSS)
- MLP (2021) — implicit feature interactions via deep layers
- DCN-v2 (2022) — explicit cross layers for pairwise feature interactions
- MoE multitask (2023) — learned expert routing. Similar tasks share experts; divergent tasks use different experts.
- Scaled models (2024) — billions of parameters. Ranking obeys scaling laws. Constraint = inference latency not training compute.
- EU Digital Services Act → algorithmic transparency
- "Why am I seeing this?" UI → must be traceable
- Chronological feed option (Instagram) → user choice
- Fairness constraints → no demographic/political bias
- Architectural implication: pure black-box NN → hybrid interpretable + neural
Design Thinking
- Multi-pass architecture
- Multitask learning + why
- Basic features + embeddings
- A/B testing as evaluation method
- Position bias as a problem
- Start with objective function, motivate every term
- Survey-based label calibration (not just clicks)
- IPS + randomisation for debiasing
- Two-tower retrieval + ANN indexing
- Multi-source candidate fusion rationale
- All 5 feedback loop pathologies + mitigations
- Guardrail metrics + why they gate not just measure
- Train-serve skew + feature stores
- Architecture evolution opinions (MoE, DCN, scaling)
- Connect to business: ads revenue, regulatory
- TikTok effect + foundation models evolution
- Trap 1: Jumping to architecture before defining objective. Fix: "Before I get to the model, let me define what we're optimising for."
- Trap 2: Treating candidate generation as trivial ("just fetch social graph posts"). Fix: Multi-source + two-tower discussion.
- Trap 3: Saying "use a neural network" without specifying architecture. Fix: MTL with shared bottom + task heads, or DCN-v2, or MoE — justify.
- Trap 4: Only mentioning positive signals (like, comment, share). Fix: Always mention hide, report, unfollow as negative signals with negative weights.
- Trap 5: Treating offline evaluation as sufficient. Fix: "Offline metrics have a selection bias problem — held-out data reflects the old ranking policy."
- Trap 6: Ignoring the feedback loop. Fix: Proactively mention pathologies + exploration budget + long-term holdouts.
- Trap 7: Not connecting to scale. Fix: Ground every architecture choice in the scale constraints from your clarifying questions. Parallelism, embedding serving, latency.