ML Systems · Feed Ranking

Facebook News Feed Ranking

Notes on Meta's 2021 engineering post on Feed ranking — a rare public description of how a social-graph-constrained, multi-objective ranking system works at 2B-user scale. Part I traces the system as published; Part II covers what the field has learned since.

Source: engineering.fb.com/2021 · Extended 2021→2026 · Published May 2026
Must know cold
Derive on the fly
Revision anchor
Key insight
LAYER 01

The Problem & Landscape

⚡ 60-SECOND SUMMARY Feed ranking is an information filtering + ordering problem. Given a user and ~1000 candidate posts at time t, find the ordering that maximises long-term value. Hard because: no explicit query (user identity IS the query), multi-objective tension (engagement ≠ value), real-time freshness demands, 2B+ user scale, and dangerous feedback loops. Fundamentally different from search (no query), ads (no direct revenue signal), and recsys (graph-constrained pool, not full catalog).
🎯
Relevance
Does content match what this person cares about?
Engagement
Will they click/like/comment? Measurable but a noisy proxy.
🌱
Long-term Value
Does this make their life better? Do they come back tomorrow?
THESE DIVERGE AT THE EXTREMES — outrage maximises engagement, destroys value
5 Core Tensions
  • Scale — 2B users × 1000+ posts = trillions of ops/day
  • Personalisation depth — coarse categories don't work
  • Real-time freshness — viral post from 5 min ago must rank now
  • Multi-objective conflict — clicks ≠ satisfaction ≠ retention
  • Feedback loop risk — model trains on data it generated
Feed vs The Cousins
SystemIntentPoolObjective
SearchExplicit queryQuery matchRelevance
AdsAdvertiser+ctxTargetingCTR × bid
RecsysInferredFull catalogEngagement
FeedIdentity+graphSocial graphLT value
COLD
Must Know Cold
  • ~1000 candidates per feed refresh
  • No explicit query — identity IS the query
  • Feed has weakest intent signal of 4 systems
  • Graph-constrained pool
DERIVE
Derive On the Fly
  • Why chronological fails → frequency ≠ relevance
  • Why feed ≠ search → no explicit intent
  • Why multi-objective needed → click ≠ satisfaction
INTERVIEW
Interview Gold
Q: How is feed different from a recommendation system?
Graph-constrained pool + weakest intent signal + value vs engagement objective. Don't just say "it's personalised."
LAYER 02

The Math & Objective Function

⚡ 60-SECOND SUMMARY One formula governs everything: V_ijt = Σ w_k · Y_ijtk. Make K predictions (one per action type), weight them by survey-calibrated importance, sum to a scalar. Two key properties: rare actions self-suppress (Y → 0 automatically), weights can be personalised per user. Features X_ijt live in 3 buckets: user (WHO), content (WHAT), context (HOW) — with cross-bucket interaction features being most predictive.
Vijt = Σk [ wijtk · Yijtk(Xijt) ]
i = post    j = user    t = time    k = action type (like, comment, share, hide…) Yijtk = predicted probability of action k  ←  output of model head k wijtk = weight of action k  ←  calibrated via SURVEYS, not just engagement data Xijt = feature vector  ←  WHO × WHAT × HOW + cross-features
Building Up: Single → Multi Signal
# Step 1: naive single signal V = P(like) ← fails: likes ≠ value # Step 2: multi signal Y = [P(like), P(comment), P(share), P(hide), P(report), P(full_watch)] # Step 3: weighted sum V_ijt = w1·Y1 + w2·Y2 + w3·Y3 - w4·Y4 - w5·Y5 ← negatives!
3 Feature Buckets + Cross
  • WHO (user): behavioral history, real-time session state, social graph strength, device
  • WHAT (content): post type, engagement velocity, content embeddings, author quality
  • HOW (context): time of day, device, session depth, geo
  • CROSS (most powerful): user↔author affinity, topic match, format preference
Cross-features can't be precomputed — depend on both user AND post simultaneously
Dense vs Sparse Features
  • Dense — continuous numbers (post age, like count) → fed directly to net
  • Sparse — categorical IDs (user ID, author ID) → must embed first
  • Embedding tables at 2B users scale → TB size → distributed param servers
Self-Suppression Property

If user rarely does action k, then Y_ijtk ≈ 0 for that user across all posts. That term contributes near-zero to V automatically — no manual per-user weight zeroing needed. The formula personalises itself implicitly.

Two Separate Weight Sets

λ_k — training loss weights: how much each task shapes the shared layers during backprop.

w_k — ranking weights: how much each prediction contributes to V_ijt at serving time.

⚠ Don't confuse these — common interview mistake

COLD
  • V_ijt formula — know every subscript
  • w_k calibrated via surveys not engagement
  • Negative signals (hide, report) enter with negative weights
  • Dense → net directly; sparse → embedding lookup first
  • Two separate weight sets: λ_k vs w_k
INTERVIEW
Q: Walk me through your objective function design.
Start with single signal → motivate failure → build to multi-signal → mention survey calibration of weights. Almost no candidates know this. Then mention the self-suppression property.
LAYER 03

System Architecture

⚡ 60-SECOND SUMMARY Two-layer system: thin PHP/Web frontend (just routes) + smart Feed Aggregator backend (all intelligence). Three storage types: actions store, objects store, summary store (precomputed running aggregates = key for real-time). The aggregator runs 4 sequential passes: integrity filter → Pass 0 (1000→500, cheap, high recall) → Pass 1 (full neural net, independent scoring, main personalisation) → Pass 2 (list-aware diversity + dedup). Parallel predictors make scoring 1000 posts feasible in real time.
DEVICEiOS/Android/Websends request
FRONTENDPHP/Web Layerthin router
BACKENDFeed Aggregatorall intelligence
STORAGEFeed Leaf DBsactions/objects/summary
The 4-Pass Pipeline — Input → Process → Output
GATE Integrity spam/misinfo/policy → hard remove
PASS 0 Fast Filter 1000 → ~500 · lightweight model · HIGH RECALL
PASS 1 Main Scoring 500 posts · full neural net · V_ijt per post · parallel
PASS 2 Contextual list-aware · diversity rules · dedup
OUTPUT Top N posts rendered on device
3 Storage Types
  • Actions store — user events (likes, comments, hides). High write throughput, fast point lookups
  • Objects store — post content + metadata. Bulk reads of 1000 posts
  • Summary store — precomputed running aggregates. Updated in near-real-time. This is what makes real-time engagement features feasible.
2 Bumping Mechanisms
  • Unread bumping — posts ranked in previous session but never seen (user didn't scroll far) → re-eligible. May be higher quality than newer posts.
  • Action bumping — posts user already saw that now have significant new activity (conversation broke out in comments) → re-surfaced as "comment-bumped."
Why Each Pass Exists
  • Integrity first → don't learn from violated content engagement
  • Pass 0 → cost gate before expensive Pass 1. High recall, not precision
  • Pass 1 independent → scores per post, no list context yet
  • Pass 2 separate → diversity is list-level, can't do in isolation
COLD
  • Summary store = precomputed running aggregates (not recount per request)
  • Unread bumping + action bumping — know both mechanisms
  • Pass 0: 1000→500, high recall goal, cheap model
  • Pass 1: independent scoring, full neural net, main personalisation
  • Pass 2: list-aware, diversity + dedup, cannot run in isolation
INTERVIEW
Q: How do you make scoring fast enough at scale?
Three answers: Pass 0 pruning (halve the pool before expensive models), parallel predictors (posts scored simultaneously across machines), embedding caches (don't re-fetch TB-scale tables per request).
LAYER 04

ML Models Deep Dive

⚡ 60-SECOND SUMMARY Core architecture: Multitask MLP — shared bottom layers learn the (user, post) representation once; task-specific heads specialise per action type. Solves label sparsity, shared structure waste, and inference cost simultaneously. Sparse features (IDs) go through embedding tables (TB scale, distributed param servers) before entering the net. Meta uses offline batch training + real-time feature updates via summary store — pragmatic middle ground between staleness and instability. Cold start handled by content embeddings + author history + engagement velocity.
Multitask Architecture
# Input dense_features ──┐ ├──► concat ──► Shared MLP (3-5 layers) sparse_IDs ──► embed ──┘ │ ┌───────┴───────┐ task heads (1-2 layers each) ├── P(like) head_1 ├── P(comment) head_2 ├── P(share) head_3 └── P(hide) head_4 ← negative Total Loss = λ1·L1 + λ2·L2 + λ3·L3 + λ4·L4
Why Multitask? (3 problems solved)
  • Label sparsity — "report" has few positives. Shared layers trained on union of all labels. Report head benefits from abundant like/comment signal.
  • Shared structure — base user taste signal is shared across all tasks. Training K separate models relearns it K times wastefully.
  • Inference cost — one shared forward pass, not K separate passes at serving time.
Embedding Types
  • User embeddings — behavioral taste profile. Similar users → close vectors.
  • Content embeddings — semantic post meaning (text/image/video). Similar topics → close vectors.
  • Author/page embeddings — content style + topic distribution.
  • Two-tower — user + content in same space. Dot product = affinity.
Embedding at Scale
  • 2B users × 128 dims × 4 bytes = ~1TB for user table alone
  • Can't sit on single machine → distributed parameter servers or sharded in-memory caches
  • Embedding lookups are often the bottleneck at serving time
  • Learned end-to-end via backprop through the ranking model
Online vs Offline Learning
  • Offline (batch) — train periodically on historical data. Stable, debuggable. Weakness: model is always stale.
  • Online — weights updated continuously. Fresh but unstable, hard infra.
  • Meta's approach: offline training + real-time feature updates. Get freshness from features, not weight updates.
Cold Start — Two Flavors
NEW USER
No behavioral history → no embeddings. Fallback: demographic priors, popularity signals. Onboarding explicit selection seeds profile.
NEW CONTENT
No engagement history → rely on author history + content embeddings + engagement velocity (fast engagement in first minute = strong signal).
INTERVIEW
Q: How do you handle a post with no engagement history?
Cold start: content embeddings provide semantic signal immediately. Author history as proxy. Engagement velocity once it starts — fast growth in first minute is powerful even at low absolute count.
Q: How do you serve embeddings at 2B-user scale?
Distributed parameter servers or sharded in-memory caches. Embedding lookups are often the serving bottleneck. This shows systems depth beyond just ML.
LAYER 05

Candidate Generation & Filtering

⚡ 60-SECOND SUMMARY Candidate generation is as hard as ranking — ranker quality is bounded by candidate pool quality. Facebook's original social graph traversal has a hard ceiling (only sees your network). Modern systems use multi-source fusion: social graph + two-tower ANN retrieval (interest-based) + trending + collaborative filtering. Two-tower trains user/content encoders with contrastive loss, pre-indexes content in an ANN index (FAISS/ScaNN/HNSW), queries with user embedding at serve time for O(log N) retrieval from full corpus. Integrity filtering runs BEFORE ranking to prevent learning from violated content engagement.
RETRIEVAL FUNNEL
Full corpus
Millions of posts in system
Multi-source candidate gen
~1000 candidates · social graph + two-tower + trending + collab filtering
Pass 0
~500 · lightweight model · high recall
Pass 1 + Pass 2
~500 scored → diversity-adjusted
Rendered
~20-50 posts shown on device
Multi-Source Candidate Fusion
Source 1: Social graph traversal ~300 → high relevance, network-bounded Source 2: Two-tower ANN retrieval ~300 → interest-based, full corpus Source 3: Trending/viral ~100 → freshness, popularity signal Source 4: Collaborative filtering ~200 → "users like you" signal Source 5: Re-engagement (bumping) ~100 → unread + action bumped ───── Total: ~1000 → Pass 0
Two-Tower Model Mechanics
# Architecture User features → User Tower → user_emb (d-dim) Post features → Content Tower → post_emb (d-dim) score = dot_product(user_emb, post_emb) # Training loss = contrastive( pos=(user, engaged_post), ← high dot product neg=(user, random_posts) ← low dot product ) # Serving pre-index all post_emb in ANN index query(user_emb) → top-K in O(log N) # FAISS / ScaNN / HNSW
Retrieval vs Ranking
  • Retrieval — recall-focused, fast, approximate. Can afford misses but not slowness.
  • Ranking — precision-focused, expensive, exact. Adjudicates between retrieved candidates.
  • ANN = approximate nearest neighbor. Trades small recall loss for massive speedup (O(N) → O(log N)).
Integrity — Why Before Ranking
  • Removes spam, misinfo, hate speech, policy violations
  • Must run before ranking — if violated content reaches users, gets engagement, enters training data → model learns to rank violations highly
  • Also: don't waste ranking compute on content that can never be shown
  • Integrity classifiers have their own recall/precision tradeoffs
Pass 0 Design Constraint
  • Goal: high recall, not high precision
  • A missed great post at Pass 0 is invisible — the ranker never gets a chance
  • A mediocre post that sneaks through Pass 0 gets demoted at Pass 1 — recoverable
  • Threshold set conservatively. Better to keep too many than drop too few.
COLD
  • Ranker quality bounded by candidate pool quality
  • Two-tower: user tower + content tower → same embedding space → ANN
  • ANN algorithms: FAISS, ScaNN, HNSW — O(log N) not O(N)
  • Integrity runs before ranking (not after) — prevents training contamination
  • Pass 0: recall over precision — false negatives are invisible
INTERVIEW
Q: How do you surface content from outside a user's network?
Two-tower + ANN retrieval. Interest-graph vs social-graph. Social graph becomes one signal among many rather than the gate. Requires pre-indexing content embeddings and querying with user embedding at serve time.
LAYER 06

Training, Evaluation & Feedback Loops

⚡ 60-SECOND SUMMARY Labels come from two sources: implicit behavioral signals (scale, noisy) + surveys (expensive, high-quality). Three systematic biases corrupt implicit labels: position, selection, conformity. Fix with IPS (inverse propensity scoring) and randomisation experiments. Offline metrics (AUC, NDCG, log loss) are necessary but not sufficient due to selection bias — a better model looks worse offline if it surfaces different content. Online pipeline: offline → shadow → A/B with three metric types (primary/secondary/guardrail). Five feedback loop pathologies will silently break the system over time if not proactively mitigated.
Label Sources
IMPLICIT (scale) strong+: share, save, full watch, long dwell mod+: like, react, click link weak+: short dwell, partial watch weak-: scroll past quickly strong-: hide post, unfollow, report EXPLICIT SURVEYS (quality) "Was this post worth your time?" "Did this feel meaningful?" "See more/less like this?" → expensive, small-scale, used to calibrate w_k → audit whether behavioural gains = perceived gains
3 Systematic Biases + Fixes
  • Position bias — higher positions get more engagement purely from position. Fix: IPS (reweight by 1/P(shown at position k)) or randomisation experiments.
  • Selection bias — only observe engagement on shown content. Fix: randomisation experiments (random ordering for small % of users = unbiased labels).
  • Conformity bias — social proof inflates already-popular content. Fix: use engagement rate not raw count; normalise by exposure.
COLD 5 FEEDBACK LOOP PATHOLOGIES
PATHOLOGY 01
Popularity Bias Amplification
Popular content shown more → more engagement → higher scores → shown even more. Rich get richer. Niche content never discovered.
Exploration budget — multi-armed bandit framing. Deliberately show lower-ranked content to gather signal.
PATHOLOGY 02
Filter Bubbles
User engages with X → model shows more X → feed collapses into ever-narrowing topic space.
Content diversity constraints (Pass 2), diversity penalty in objective, explicit user controls.
PATHOLOGY 03
Engagement Bait Adaptation
Creators reverse-engineer objective weights. "Comment YES or NO." "Watch to the end." Signal degrades as creators optimise against the system.
Rotate/obscure weights. Dedicated engagement-bait classifiers. Survey-based signals harder to game.
PATHOLOGY 04
Concept Drift
User interests shift over time. Model over-indexed on long history gets stuck in the past. "You liked cooking 2 years ago → still showing cooking."
Recency weighting in training data. Explicit decay on older signals. Real-time session features for current-intent capture.
PATHOLOGY 05
Cold Start Loop
New content can't get engagement because never shown. Never shown because no engagement. Classic chicken-and-egg that permanently disadvantages new creators.
Dedicated exploration for new content. Content embeddings to bootstrap signal. Author reputation as proxy. TikTok approach: show every new video to small random sample, observe watch rate.
Evaluation Pipeline
STAGE 1Offline MetricsAUC-ROC, NDCG, Log Loss
← fast, cheap, necessary but not sufficient
STAGE 2Shadow Modenew model runs parallel, logs only
← catch catastrophic failures before users
STAGE 3A/B Test1-2 weeks, primary+secondary+guardrail
← ground truth for engagement metrics
STAGE 4Long-term Holdoutmonths, detects slow-burn degradation
← small group never receives change
A/B Metric Categories
PRIMARY — what you want to improve
DAU/MAU, session length, interactions per session
SECONDARY — quality signal
Survey scores, see-less rate, unfollow rate after session
GUARDRAIL — must not regress
Integrity metrics, ad revenue, newcomer experience
⚠ Guardrail defines the floor. Engagement up + reported content up = NOT shippable.
Q: Your feed ranking model has been live for 6 months and engagement is slowly declining. How do you debug this?
This is a feedback loop pathology question. Walk through systematically: (1) Check concept drift — user interests shifted, model trained on stale history? (2) Check popularity bias — feed homogenised around top creators, long-tail content dying? (3) Check engagement bait — creators gaming weights, signal quality degrading? (4) Check feature drift — has data distribution in features shifted? (5) Compare against long-term holdout if one exists. Mitigation: exploration budget increase, recency-weight the training data harder, re-survey users on perceived quality.
Part II — Where Things Stand Today

The 2021 engineering post described a system designed around a social graph and multi-pass ranking pipeline. What's changed since, and how has the broader field moved? This section traces the evolution at Meta and across the industry — from TikTok's interest-graph challenge to foundation model integration to real-time infrastructure.

LAYER 07

Evolution Since 2021 — State of the Art

⚡ 60-SECOND SUMMARY Three forces reshaped feed ranking post-2021: TikTok effect (interest graph displaced social graph as primary retrieval source — Meta now has ~2× AI-recommended unconnected content), foundation models (unified multimodal embeddings replaced task-specific classifiers; LLMs generate offline semantic features), and real-time ML infra (streaming feature stores eliminate train-serve skew). Model architecture evolved: MLP → DCN-v2 (explicit feature interactions) → MoE multitask (learned expert routing) → billion-parameter scaled models. Watch time (TikTok) is a cleaner label than likes because it has lower social obligation confound.
2018
Meaningful Social Interactions (MSI) — reweighted objective toward person-to-person engagement. First major signal that engagement ≠ value.
2021
Blog published. MTL neural nets, embeddings, offline learning. Social graph primary. Bumping logic for freshness. Multi-pass scoring architecture.
2022
DCN-v2 and feature interaction modeling widely adopted. Explicit cross layers replace implicit MLP-learned interactions. TikTok pressure intensifies — unconnected content % rising.
2023
Meta acknowledges AI-recommended content doubled. Two-tower retrieval becomes primary, not supplementary. Mixture of Experts (MoE) multitask replaces hard-shared bottom. Foundation model content embeddings replace task-specific classifiers.
2024
Scaling ranking models — billion-parameter ranking models. Streaming feature stores mainstream. LLM-generated offline features (topic tags, sentiment, intent classification) enter X_ijt.
2025-26
LLM re-ranking for top-K slots being explored. KernelEvolve (Meta, Apr 2026) — AI agents optimising ranking infrastructure itself. EU DSA regulatory compliance driving explainability constraints.
TikTok vs Meta — 3 Key Axes
AxisTikTokMeta (2021→now)
Candidate sourcePure interest graphHybrid social+interest
Label qualityWatch time (clean, low social friction)Likes (socially confounded)
Content formatHomogeneous short videoHeterogeneous (text/photo/video/reels)
Cold startRandom small sample → observe watch rateAuthor history + content embeddings
Social graphNot requiredPrimary → one signal among many
Where Foundation Models Plug In
  • Content understanding (upstream) — unified multimodal embeddings (text+image+video in one space) replace separate task-specific classifiers. Cross-modal understanding.
  • Semantic interest expansion — LLMs connect "marathon training" to "endurance nutrition" without explicit behavioral signal.
  • Offline feature generation — LLM runs offline on all content → structured features (topic, tone, intent) → cached in X_ijt. No LLM at serving time.
  • LLM re-ranking (emerging) — too expensive for all traffic, being explored for top-K slots only.
Feature Store Architecture
  • Offline store — batch-computed, historical, for training
  • Online store — low-latency serving, point-in-time correct
  • Consistency — same feature logic for both → eliminates train-serve skew
  • Streaming — Kafka → Flink → online store → sub-second freshness
  • Examples: Uber Michelangelo, Airbnb Zipline, Feast (OSS)
Model Architecture Evolution
  • MLP (2021) — implicit feature interactions via deep layers
  • DCN-v2 (2022) — explicit cross layers for pairwise feature interactions
  • MoE multitask (2023) — learned expert routing. Similar tasks share experts; divergent tasks use different experts.
  • Scaled models (2024) — billions of parameters. Ranking obeys scaling laws. Constraint = inference latency not training compute.
Regulatory Pressure
  • EU Digital Services Act → algorithmic transparency
  • "Why am I seeing this?" UI → must be traceable
  • Chronological feed option (Instagram) → user choice
  • Fairness constraints → no demographic/political bias
  • Architectural implication: pure black-box NN → hybrid interpretable + neural
Q: How would you improve Meta's 2021 system if building it today?
(1) Two-tower as primary retrieval source (not supplementary) — removes social graph as hard gate. (2) Streaming feature store — replace batch+point-lookup with sub-second streaming features. (3) Foundation model content embeddings — unified multimodal, replace task-specific classifiers, better cold start. (4) MoE multitask architecture — replace hard-shared bottom. (5) Watch-time style signals for Reels as cleaner label. (6) LLM-generated offline semantic features enriching X_ijt.
LAYER 08

Design Thinking

⚡ 60-SECOND SUMMARY The interview tests first-principles design thinking, not memorised architecture. Always start with the objective function — never jump to models first. Know the 4 question flavors. Ask clarifying questions on scale, objective, content scope, constraints before designing. L5 differs from L4 by: knowing failure modes, reasoning from first principles, connecting components to business constraints, knowing the evolution. Avoid 7 classic traps. The power moves: mentioning survey-calibrated weights, selection bias in offline metrics, exploration budget for feedback loops, and social graph as signal not gate.
Right Structure for Open-Ended Design
1. Clarify scope — scale, latency SLA, content types, social vs unconnected, ads?
2. Define objective function — V_ijt formula, multi-signal, survey calibration
3. Data — features (3 buckets + cross), labels (implicit + surveys + debiasing)
4. Candidate generation — multi-source fusion, two-tower, integrity filter
5. Ranking model — MTL architecture, multi-pass pipeline
6. Training pipeline — label collection, debiasing, offline + real-time features
7. Evaluation — offline metrics (caveated) → shadow → A/B with 3 metric types
8. Production concerns — latency, embedding serving, feature freshness
9. Monitoring — feedback loop pathologies, long-term holdouts
L4 vs L5 — The Concrete Difference
SOLID FOUNDATION
  • Multi-pass architecture
  • Multitask learning + why
  • Basic features + embeddings
  • A/B testing as evaluation method
  • Position bias as a problem
DEEPER UNDERSTANDING
  • Start with objective function, motivate every term
  • Survey-based label calibration (not just clicks)
  • IPS + randomisation for debiasing
  • Two-tower retrieval + ANN indexing
  • Multi-source candidate fusion rationale
  • All 5 feedback loop pathologies + mitigations
  • Guardrail metrics + why they gate not just measure
  • Train-serve skew + feature stores
  • Architecture evolution opinions (MoE, DCN, scaling)
  • Connect to business: ads revenue, regulatory
  • TikTok effect + foundation models evolution
COLD 7 TRAPS MOST CANDIDATES FALL INTO
  • Trap 1: Jumping to architecture before defining objective. Fix: "Before I get to the model, let me define what we're optimising for."
  • Trap 2: Treating candidate generation as trivial ("just fetch social graph posts"). Fix: Multi-source + two-tower discussion.
  • Trap 3: Saying "use a neural network" without specifying architecture. Fix: MTL with shared bottom + task heads, or DCN-v2, or MoE — justify.
  • Trap 4: Only mentioning positive signals (like, comment, share). Fix: Always mention hide, report, unfollow as negative signals with negative weights.
  • Trap 5: Treating offline evaluation as sufficient. Fix: "Offline metrics have a selection bias problem — held-out data reflects the old ranking policy."
  • Trap 6: Ignoring the feedback loop. Fix: Proactively mention pathologies + exploration budget + long-term holdouts.
  • Trap 7: Not connecting to scale. Fix: Ground every architecture choice in the scale constraints from your clarifying questions. Parallelism, embedding serving, latency.
INTERVIEW GOLD POWER MOVE PHRASES
On objective function
"Proxy metric — engagement — diverges from true objective — long-term value — at the extremes. Outrage maximises engagement but destroys trust. This is why Meta uses survey-based labels to calibrate ranking weights, not just behavioural signals."
On candidate generation
"Ranker quality is bounded by candidate pool quality. A perfect ranker over a bad candidate set still produces a bad feed. This is why I'd invest heavily in multi-source candidate fusion with two-tower retrieval rather than relying purely on social graph traversal."
On offline evaluation
"Offline metrics have a fundamental selection bias problem — we can only evaluate on posts the current system chose to show. A model that surfaces completely different content looks worse offline even if it would be better in production."
On feedback loops
"I'd build in an exploration budget from day one — treat it like a multi-armed bandit. Every iteration the model trains on data it generated itself. Without intervention this amplifies popularity bias and creates filter bubbles."
LAYER 09

Core Concepts

⚡ 60-SECOND SUMMARY Minimise brute-force memorisation. Bucket A (cold recall) = facts and formulas with no shortcut. Bucket B (derive on fly) = understand the WHY and reconstruct under pressure. Bucket C (revision anchors) = the connective tissue. The master insight: everything in the system exists because engagement ≠ long-term value. The system is a funnel. If you understand that one tension and can reason from first principles, you can reconstruct any component in the interview.
🔴 BUCKET A — COLD RECALL
Formula: V_ijt = Σ_k [w_ijtk · Y_ijtk(X_ijt)]
Feature buckets: WHO (user) / WHAT (content) / HOW (context) / CROSS (most powerful)
4 passes: Integrity → Pass 0 (1000→500) → Pass 1 (full net) → Pass 2 (diversity)
2 bumping: unread + action
3 storage: actions / objects / summary store
MTL solves: label sparsity + shared structure + inference cost
5 pathologies: popularity bias, filter bubbles, engagement bait, concept drift, cold start loop
3 label biases: position, selection, conformity
Eval pipeline: offline → shadow → A/B (primary/secondary/guardrail) → long-term holdout
TikTok vs Meta: interest vs social graph, watch time vs likes, homogeneous vs heterogeneous
3 post-2021 forces: TikTok effect + foundation models + real-time infra
ANN algorithms: FAISS / ScaNN / HNSW — O(log N)
Two weight sets: λ_k (training loss) ≠ w_k (ranking) — don't confuse
🟢 BUCKET B — DERIVE ON FLY
Why MTL? → label sparsity + shared structure + inference cost. Never memorise, derive.
Why multi-pass? → expensive models × 1000 posts. Pass 0 = cost gate. Pass 2 = list-level constraints impossible in Pass 1.
Why embeddings? → sparse IDs can't one-hot at 2B scale. Semantic similarity = geometric closeness.
Why surveys? → implicit labels measure clicks not value. Surveys directly measure perceived value. Used to calibrate w_k.
Why offline metrics insufficient? → selection bias: held-out data reflects old policy, not new model's outputs.
Why integrity before ranking? → prevents model learning from violated content engagement.
Why feature stores? → train-serve skew silently degrades model. Same feature logic for both = consistency.
Why Pass 2 separate? → diversity is list-level, can't apply when scoring posts independently.
Why two-tower for retrieval not ranking? → dot product is ANN-compatible. Cross-attention is not.
Why content embeddings matter for cold start? → semantic signal even before any engagement exists.
🔵 BUCKET C — REVISION ANCHORS
Master insight: Everything exists because engagement ≠ long-term value. One tension, entire system.
The funnel: Millions → 1000 (retrieval) → 500 (Pass 0) → 500 scored (Pass 1) → diversity-adjusted → 20-50 rendered. Each stage: recall at input, precision at output.
Social graph shifting: 2018: gate. 2021: primary. 2024: one signal among many. Design new systems with direction of travel in mind.
Real-time pragmatism: Full online learning = unstable. Pure batch = stale. Meta's solution: batch weights + real-time feature updates via summary store.
Every component has a "why not simpler" answer: If you know why Pass 0 exists over just running Pass 1 on all 1000, you're thinking right.
Guardrails define the floor: Engagement up + integrity down = not shippable. Never frame evaluation as "did engagement go up?"
Feedback loops are silent killers: System works at launch, slowly breaks itself. Proactive design (exploration budget, diversity, holdouts) not reactive firefighting.
⚡ PRE-INTERVIEW EXECUTION CHECKLIST
Ask clarifying questions before designing (scale, objective, content scope, constraints)
Start with objective function — V_ijt formula and motivation for every term
Build the system as a funnel — name all stages and why each exists
For every component: what it does + why it exists + what simpler alternative was rejected + why
Mention negative signals (hide, report) — most candidates forget entirely
Mention survey-based label calibration — almost no candidates know this
Name feedback loop pathologies when discussing monitoring
Caveat offline metrics: selection bias + need for A/B with guardrail metrics
Connect to evolution: TikTok effect, foundation models, streaming feature stores
End with: "What I'd prioritise if building this today vs 2021"
Source: engineering.fb.com/2021/01/26 · Extended 2021→2026 9 Layers · ML Systems · Updated May 2026