FB News Feed Ranking — ML System Design Master Notes

LAYER 01

The Problem & Landscape

▼

⚡ 60-SECOND SUMMARY Feed ranking is an information filtering + ordering problem. Given a user and ~1000 candidate posts at time t, find the ordering that maximises long-term value. Hard because: no explicit query (user identity IS the query), multi-objective tension (engagement ≠ value), real-time freshness demands, 2B+ user scale, and dangerous feedback loops. Fundamentally different from search (no query), ads (no direct revenue signal), and recsys (graph-constrained pool, not full catalog).

🎯

Relevance

Does content match what this person cares about?

⚡

Engagement

Will they click/like/comment? Measurable but a noisy proxy.

🌱

Long-term Value

Does this make their life better? Do they come back tomorrow?

THESE DIVERGE AT THE EXTREMES — outrage maximises engagement, destroys value

5 Core Tensions

Scale — 2B users × 1000+ posts = trillions of ops/day
Personalisation depth — coarse categories don't work
Real-time freshness — viral post from 5 min ago must rank now
Multi-objective conflict — clicks ≠ satisfaction ≠ retention
Feedback loop risk — model trains on data it generated

Feed vs The Cousins

System	Intent	Pool	Objective
Search	Explicit query	Query match	Relevance
Ads	Advertiser+ctx	Targeting	CTR × bid
Recsys	Inferred	Full catalog	Engagement
Feed	Identity+graph	Social graph	LT value

COLD

Must Know Cold

~1000 candidates per feed refresh
No explicit query — identity IS the query
Feed has weakest intent signal of 4 systems
Graph-constrained pool

DERIVE

Derive On the Fly

Why chronological fails → frequency ≠ relevance
Why feed ≠ search → no explicit intent
Why multi-objective needed → click ≠ satisfaction

INTERVIEW

Interview Gold

Q: How is feed different from a recommendation system?

Graph-constrained pool + weakest intent signal + value vs engagement objective. Don't just say "it's personalised."

LAYER 02

The Math & Objective Function

▼

⚡ 60-SECOND SUMMARY One formula governs everything: V_ijt = Σ w_k · Y_ijtk. Make K predictions (one per action type), weight them by survey-calibrated importance, sum to a scalar. Two key properties: rare actions self-suppress (Y → 0 automatically), weights can be personalised per user. Features X_ijt live in 3 buckets: user (WHO), content (WHAT), context (HOW) — with cross-bucket interaction features being most predictive.

V_ijt = Σ_k [ w_ijtk · Y_ijtk(X_ijt) ]

i = post j = user t = time k = action type (like, comment, share, hide…) Y_ijtk = predicted probability of action k ← output of model head k w_ijtk = weight of action k ← calibrated via SURVEYS, not just engagement data X_ijt = feature vector ← WHO × WHAT × HOW + cross-features

Building Up: Single → Multi Signal

# Step 1: naive single signal V = P(like) ← fails: likes ≠ value # Step 2: multi signal Y = [P(like), P(comment), P(share), P(hide), P(report), P(full_watch)] # Step 3: weighted sum V_ijt = w1·Y1 + w2·Y2 + w3·Y3 - w4·Y4 - w5·Y5 ← negatives!

3 Feature Buckets + Cross

WHO (user): behavioral history, real-time session state, social graph strength, device
WHAT (content): post type, engagement velocity, content embeddings, author quality
HOW (context): time of day, device, session depth, geo
CROSS (most powerful): user↔author affinity, topic match, format preference

Cross-features can't be precomputed — depend on both user AND post simultaneously

Dense vs Sparse Features

Dense — continuous numbers (post age, like count) → fed directly to net
Sparse — categorical IDs (user ID, author ID) → must embed first
Embedding tables at 2B users scale → TB size → distributed param servers

Self-Suppression Property

If user rarely does action k, then Y_ijtk ≈ 0 for that user across all posts. That term contributes near-zero to V automatically — no manual per-user weight zeroing needed. The formula personalises itself implicitly.

Two Separate Weight Sets

λ_k — training loss weights: how much each task shapes the shared layers during backprop.

w_k — ranking weights: how much each prediction contributes to V_ijt at serving time.

⚠ Don't confuse these — common interview mistake

COLD

V_ijt formula — know every subscript
w_k calibrated via surveys not engagement
Negative signals (hide, report) enter with negative weights
Dense → net directly; sparse → embedding lookup first
Two separate weight sets: λ_k vs w_k

INTERVIEW

Q: Walk me through your objective function design.

Start with single signal → motivate failure → build to multi-signal → mention survey calibration of weights. Almost no candidates know this. Then mention the self-suppression property.

LAYER 03

System Architecture

▼

⚡ 60-SECOND SUMMARY Two-layer system: thin PHP/Web frontend (just routes) + smart Feed Aggregator backend (all intelligence). Three storage types: actions store, objects store, summary store (precomputed running aggregates = key for real-time). The aggregator runs 4 sequential passes: integrity filter → Pass 0 (1000→500, cheap, high recall) → Pass 1 (full neural net, independent scoring, main personalisation) → Pass 2 (list-aware diversity + dedup). Parallel predictors make scoring 1000 posts feasible in real time.

DEVICEiOS/Android/Websends request

→

FRONTENDPHP/Web Layerthin router

→

BACKENDFeed Aggregatorall intelligence

→

STORAGEFeed Leaf DBsactions/objects/summary

The 4-Pass Pipeline — Input → Process → Output

GATE Integrity spam/misinfo/policy → hard remove

→

PASS 0 Fast Filter 1000 → ~500 · lightweight model · HIGH RECALL

→

        PASS 1
        Main Scoring
        500 posts · full neural net · V_ijt per post · parallel
      

→

PASS 2 Contextual list-aware · diversity rules · dedup

→

OUTPUT Top N posts rendered on device

3 Storage Types

Actions store — user events (likes, comments, hides). High write throughput, fast point lookups
Objects store — post content + metadata. Bulk reads of 1000 posts
Summary store — precomputed running aggregates. Updated in near-real-time. This is what makes real-time engagement features feasible.

2 Bumping Mechanisms

Unread bumping — posts ranked in previous session but never seen (user didn't scroll far) → re-eligible. May be higher quality than newer posts.
Action bumping — posts user already saw that now have significant new activity (conversation broke out in comments) → re-surfaced as "comment-bumped."

Why Each Pass Exists

Integrity first → don't learn from violated content engagement
Pass 0 → cost gate before expensive Pass 1. High recall, not precision
Pass 1 independent → scores per post, no list context yet
Pass 2 separate → diversity is list-level, can't do in isolation

COLD

Summary store = precomputed running aggregates (not recount per request)
Unread bumping + action bumping — know both mechanisms
Pass 0: 1000→500, high recall goal, cheap model
Pass 1: independent scoring, full neural net, main personalisation
Pass 2: list-aware, diversity + dedup, cannot run in isolation

INTERVIEW

Q: How do you make scoring fast enough at scale?

Three answers: Pass 0 pruning (halve the pool before expensive models), parallel predictors (posts scored simultaneously across machines), embedding caches (don't re-fetch TB-scale tables per request).

LAYER 04

ML Models Deep Dive

▼

⚡ 60-SECOND SUMMARY Core architecture: Multitask MLP — shared bottom layers learn the (user, post) representation once; task-specific heads specialise per action type. Solves label sparsity, shared structure waste, and inference cost simultaneously. Sparse features (IDs) go through embedding tables (TB scale, distributed param servers) before entering the net. Meta uses offline batch training + real-time feature updates via summary store — pragmatic middle ground between staleness and instability. Cold start handled by content embeddings + author history + engagement velocity.

Multitask Architecture

# Input dense_features ──┐ ├──► concat ──► Shared MLP (3-5 layers) sparse_IDs ──► embed ──┘ │ ┌───────┴───────┐ task heads (1-2 layers each) ├── P(like) head_1 ├── P(comment) head_2 ├── P(share) head_3 └── P(hide) head_4 ← negative Total Loss = λ1·L1 + λ2·L2 + λ3·L3 + λ4·L4

Why Multitask? (3 problems solved)

Label sparsity — "report" has few positives. Shared layers trained on union of all labels. Report head benefits from abundant like/comment signal.
Shared structure — base user taste signal is shared across all tasks. Training K separate models relearns it K times wastefully.
Inference cost — one shared forward pass, not K separate passes at serving time.

Embedding Types

User embeddings — behavioral taste profile. Similar users → close vectors.
Content embeddings — semantic post meaning (text/image/video). Similar topics → close vectors.
Author/page embeddings — content style + topic distribution.
Two-tower — user + content in same space. Dot product = affinity.

Embedding at Scale

2B users × 128 dims × 4 bytes = ~1TB for user table alone
Can't sit on single machine → distributed parameter servers or sharded in-memory caches
Embedding lookups are often the bottleneck at serving time
Learned end-to-end via backprop through the ranking model

Online vs Offline Learning

Offline (batch) — train periodically on historical data. Stable, debuggable. Weakness: model is always stale.
Online — weights updated continuously. Fresh but unstable, hard infra.
Meta's approach: offline training + real-time feature updates. Get freshness from features, not weight updates.

Cold Start — Two Flavors

NEW USER

No behavioral history → no embeddings. Fallback: demographic priors, popularity signals. Onboarding explicit selection seeds profile.

NEW CONTENT

No engagement history → rely on author history + content embeddings + engagement velocity (fast engagement in first minute = strong signal).

INTERVIEW

Q: How do you handle a post with no engagement history?

Cold start: content embeddings provide semantic signal immediately. Author history as proxy. Engagement velocity once it starts — fast growth in first minute is powerful even at low absolute count.

Q: How do you serve embeddings at 2B-user scale?

Distributed parameter servers or sharded in-memory caches. Embedding lookups are often the serving bottleneck. This shows systems depth beyond just ML.

LAYER 05

Candidate Generation & Filtering

▼

⚡ 60-SECOND SUMMARY Candidate generation is as hard as ranking — ranker quality is bounded by candidate pool quality. Facebook's original social graph traversal has a hard ceiling (only sees your network). Modern systems use multi-source fusion: social graph + two-tower ANN retrieval (interest-based) + trending + collaborative filtering. Two-tower trains user/content encoders with contrastive loss, pre-indexes content in an ANN index (FAISS/ScaNN/HNSW), queries with user embedding at serve time for O(log N) retrieval from full corpus. Integrity filtering runs BEFORE ranking to prevent learning from violated content engagement.

RETRIEVAL FUNNEL

Full corpus

Millions of posts in system

Multi-source candidate gen

~1000 candidates · social graph + two-tower + trending + collab filtering

Pass 0

~500 · lightweight model · high recall

Pass 1 + Pass 2

~500 scored → diversity-adjusted

Rendered

~20-50 posts shown on device

Multi-Source Candidate Fusion

Source 1: Social graph traversal ~300 → high relevance, network-bounded Source 2: Two-tower ANN retrieval ~300 → interest-based, full corpus Source 3: Trending/viral ~100 → freshness, popularity signal Source 4: Collaborative filtering ~200 → "users like you" signal Source 5: Re-engagement (bumping) ~100 → unread + action bumped ───── Total: ~1000 → Pass 0

Two-Tower Model Mechanics

# Architecture User features → User Tower → user_emb (d-dim) Post features → Content Tower → post_emb (d-dim) score = dot_product(user_emb, post_emb) # Training loss = contrastive( pos=(user, engaged_post), ← high dot product neg=(user, random_posts) ← low dot product ) # Serving pre-index all post_emb in ANN index query(user_emb) → top-K in O(log N) # FAISS / ScaNN / HNSW

Retrieval vs Ranking

Retrieval — recall-focused, fast, approximate. Can afford misses but not slowness.
Ranking — precision-focused, expensive, exact. Adjudicates between retrieved candidates.
ANN = approximate nearest neighbor. Trades small recall loss for massive speedup (O(N) → O(log N)).

Integrity — Why Before Ranking

Removes spam, misinfo, hate speech, policy violations
Must run before ranking — if violated content reaches users, gets engagement, enters training data → model learns to rank violations highly
Also: don't waste ranking compute on content that can never be shown
Integrity classifiers have their own recall/precision tradeoffs

Pass 0 Design Constraint

Goal: high recall, not high precision
A missed great post at Pass 0 is invisible — the ranker never gets a chance
A mediocre post that sneaks through Pass 0 gets demoted at Pass 1 — recoverable
Threshold set conservatively. Better to keep too many than drop too few.

COLD

Ranker quality bounded by candidate pool quality
Two-tower: user tower + content tower → same embedding space → ANN
ANN algorithms: FAISS, ScaNN, HNSW — O(log N) not O(N)
Integrity runs before ranking (not after) — prevents training contamination
Pass 0: recall over precision — false negatives are invisible

INTERVIEW

Q: How do you surface content from outside a user's network?

Two-tower + ANN retrieval. Interest-graph vs social-graph. Social graph becomes one signal among many rather than the gate. Requires pre-indexing content embeddings and querying with user embedding at serve time.

LAYER 06

Training, Evaluation & Feedback Loops

▼

⚡ 60-SECOND SUMMARY Labels come from two sources: implicit behavioral signals (scale, noisy) + surveys (expensive, high-quality). Three systematic biases corrupt implicit labels: position, selection, conformity. Fix with IPS (inverse propensity scoring) and randomisation experiments. Offline metrics (AUC, NDCG, log loss) are necessary but not sufficient due to selection bias — a better model looks worse offline if it surfaces different content. Online pipeline: offline → shadow → A/B with three metric types (primary/secondary/guardrail). Five feedback loop pathologies will silently break the system over time if not proactively mitigated.

Label Sources

IMPLICIT (scale) strong+: share, save, full watch, long dwell mod+: like, react, click link weak+: short dwell, partial watch weak-: scroll past quickly strong-: hide post, unfollow, report EXPLICIT SURVEYS (quality) "Was this post worth your time?" "Did this feel meaningful?" "See more/less like this?" → expensive, small-scale, used to calibrate w_k → audit whether behavioural gains = perceived gains

3 Systematic Biases + Fixes

Position bias — higher positions get more engagement purely from position. Fix: IPS (reweight by 1/P(shown at position k)) or randomisation experiments.
Selection bias — only observe engagement on shown content. Fix: randomisation experiments (random ordering for small % of users = unbiased labels).
Conformity bias — social proof inflates already-popular content. Fix: use engagement rate not raw count; normalise by exposure.

COLD 5 FEEDBACK LOOP PATHOLOGIES

PATHOLOGY 01

Popularity Bias Amplification

Popular content shown more → more engagement → higher scores → shown even more. Rich get richer. Niche content never discovered.

Exploration budget — multi-armed bandit framing. Deliberately show lower-ranked content to gather signal.

PATHOLOGY 02

Filter Bubbles

User engages with X → model shows more X → feed collapses into ever-narrowing topic space.

Content diversity constraints (Pass 2), diversity penalty in objective, explicit user controls.

PATHOLOGY 03

Engagement Bait Adaptation

Creators reverse-engineer objective weights. "Comment YES or NO." "Watch to the end." Signal degrades as creators optimise against the system.

Rotate/obscure weights. Dedicated engagement-bait classifiers. Survey-based signals harder to game.

PATHOLOGY 04

Concept Drift

User interests shift over time. Model over-indexed on long history gets stuck in the past. "You liked cooking 2 years ago → still showing cooking."

Recency weighting in training data. Explicit decay on older signals. Real-time session features for current-intent capture.

PATHOLOGY 05

Cold Start Loop

New content can't get engagement because never shown. Never shown because no engagement. Classic chicken-and-egg that permanently disadvantages new creators.

Dedicated exploration for new content. Content embeddings to bootstrap signal. Author reputation as proxy. TikTok approach: show every new video to small random sample, observe watch rate.

Evaluation Pipeline

STAGE 1Offline MetricsAUC-ROC, NDCG, Log Loss

← fast, cheap, necessary but not sufficient

↓

STAGE 2Shadow Modenew model runs parallel, logs only

← catch catastrophic failures before users

↓

STAGE 3A/B Test1-2 weeks, primary+secondary+guardrail

← ground truth for engagement metrics

↓

STAGE 4Long-term Holdoutmonths, detects slow-burn degradation

← small group never receives change

A/B Metric Categories

PRIMARY — what you want to improve

DAU/MAU, session length, interactions per session

SECONDARY — quality signal

Survey scores, see-less rate, unfollow rate after session

GUARDRAIL — must not regress

Integrity metrics, ad revenue, newcomer experience

⚠ Guardrail defines the floor. Engagement up + reported content up = NOT shippable.

Q: Your feed ranking model has been live for 6 months and engagement is slowly declining. How do you debug this?

This is a feedback loop pathology question. Walk through systematically: (1) Check concept drift — user interests shifted, model trained on stale history? (2) Check popularity bias — feed homogenised around top creators, long-tail content dying? (3) Check engagement bait — creators gaming weights, signal quality degrading? (4) Check feature drift — has data distribution in features shifted? (5) Compare against long-term holdout if one exists. Mitigation: exploration budget increase, recency-weight the training data harder, re-survey users on perceived quality.

LAYER 07

Evolution Since 2021 — State of the Art

▼

⚡ 60-SECOND SUMMARY Three forces reshaped feed ranking post-2021: TikTok effect (interest graph displaced social graph as primary retrieval source — Meta now has ~2× AI-recommended unconnected content), foundation models (unified multimodal embeddings replaced task-specific classifiers; LLMs generate offline semantic features), and real-time ML infra (streaming feature stores eliminate train-serve skew). Model architecture evolved: MLP → DCN-v2 (explicit feature interactions) → MoE multitask (learned expert routing) → billion-parameter scaled models. Watch time (TikTok) is a cleaner label than likes because it has lower social obligation confound.

2018

Meaningful Social Interactions (MSI) — reweighted objective toward person-to-person engagement. First major signal that engagement ≠ value.

2021

Blog published. MTL neural nets, embeddings, offline learning. Social graph primary. Bumping logic for freshness. Multi-pass scoring architecture.

2022

DCN-v2 and feature interaction modeling widely adopted. Explicit cross layers replace implicit MLP-learned interactions. TikTok pressure intensifies — unconnected content % rising.

2023

Meta acknowledges AI-recommended content doubled. Two-tower retrieval becomes primary, not supplementary. Mixture of Experts (MoE) multitask replaces hard-shared bottom. Foundation model content embeddings replace task-specific classifiers.

2024

Scaling ranking models — billion-parameter ranking models. Streaming feature stores mainstream. LLM-generated offline features (topic tags, sentiment, intent classification) enter X_ijt.

2025-26

LLM re-ranking for top-K slots being explored. KernelEvolve (Meta, Apr 2026) — AI agents optimising ranking infrastructure itself. EU DSA regulatory compliance driving explainability constraints.

TikTok vs Meta — 3 Key Axes

Axis	TikTok	Meta (2021→now)
Candidate source	Pure interest graph	Hybrid social+interest
Label quality	Watch time (clean, low social friction)	Likes (socially confounded)
Content format	Homogeneous short video	Heterogeneous (text/photo/video/reels)
Cold start	Random small sample → observe watch rate	Author history + content embeddings
Social graph	Not required	Primary → one signal among many

Where Foundation Models Plug In

Content understanding (upstream) — unified multimodal embeddings (text+image+video in one space) replace separate task-specific classifiers. Cross-modal understanding.
Semantic interest expansion — LLMs connect "marathon training" to "endurance nutrition" without explicit behavioral signal.
Offline feature generation — LLM runs offline on all content → structured features (topic, tone, intent) → cached in X_ijt. No LLM at serving time.
LLM re-ranking (emerging) — too expensive for all traffic, being explored for top-K slots only.

Feature Store Architecture

Offline store — batch-computed, historical, for training
Online store — low-latency serving, point-in-time correct
Consistency — same feature logic for both → eliminates train-serve skew
Streaming — Kafka → Flink → online store → sub-second freshness
Examples: Uber Michelangelo, Airbnb Zipline, Feast (OSS)

Model Architecture Evolution

MLP (2021) — implicit feature interactions via deep layers
DCN-v2 (2022) — explicit cross layers for pairwise feature interactions
MoE multitask (2023) — learned expert routing. Similar tasks share experts; divergent tasks use different experts.
Scaled models (2024) — billions of parameters. Ranking obeys scaling laws. Constraint = inference latency not training compute.

Regulatory Pressure

EU Digital Services Act → algorithmic transparency
"Why am I seeing this?" UI → must be traceable
Chronological feed option (Instagram) → user choice
Fairness constraints → no demographic/political bias
Architectural implication: pure black-box NN → hybrid interpretable + neural

Q: How would you improve Meta's 2021 system if building it today?

(1) Two-tower as primary retrieval source (not supplementary) — removes social graph as hard gate. (2) Streaming feature store — replace batch+point-lookup with sub-second streaming features. (3) Foundation model content embeddings — unified multimodal, replace task-specific classifiers, better cold start. (4) MoE multitask architecture — replace hard-shared bottom. (5) Watch-time style signals for Reels as cleaner label. (6) LLM-generated offline semantic features enriching X_ijt.

LAYER 08

Design Thinking

▼

⚡ 60-SECOND SUMMARY The interview tests first-principles design thinking, not memorised architecture. Always start with the objective function — never jump to models first. Know the 4 question flavors. Ask clarifying questions on scale, objective, content scope, constraints before designing. L5 differs from L4 by: knowing failure modes, reasoning from first principles, connecting components to business constraints, knowing the evolution. Avoid 7 classic traps. The power moves: mentioning survey-calibrated weights, selection bias in offline metrics, exploration budget for feedback loops, and social graph as signal not gate.

Right Structure for Open-Ended Design

1. Clarify scope — scale, latency SLA, content types, social vs unconnected, ads?

2. Define objective function — V_ijt formula, multi-signal, survey calibration

3. Data — features (3 buckets + cross), labels (implicit + surveys + debiasing)

4. Candidate generation — multi-source fusion, two-tower, integrity filter

5. Ranking model — MTL architecture, multi-pass pipeline

6. Training pipeline — label collection, debiasing, offline + real-time features

7. Evaluation — offline metrics (caveated) → shadow → A/B with 3 metric types

8. Production concerns — latency, embedding serving, feature freshness

9. Monitoring — feedback loop pathologies, long-term holdouts

L4 vs L5 — The Concrete Difference

SOLID FOUNDATION

Multi-pass architecture
Multitask learning + why
Basic features + embeddings
A/B testing as evaluation method
Position bias as a problem

DEEPER UNDERSTANDING

Start with objective function, motivate every term
Survey-based label calibration (not just clicks)
IPS + randomisation for debiasing
Two-tower retrieval + ANN indexing
Multi-source candidate fusion rationale
All 5 feedback loop pathologies + mitigations
Guardrail metrics + why they gate not just measure
Train-serve skew + feature stores
Architecture evolution opinions (MoE, DCN, scaling)
Connect to business: ads revenue, regulatory
TikTok effect + foundation models evolution

COLD 7 TRAPS MOST CANDIDATES FALL INTO

Trap 1: Jumping to architecture before defining objective. Fix: "Before I get to the model, let me define what we're optimising for."
Trap 2: Treating candidate generation as trivial ("just fetch social graph posts"). Fix: Multi-source + two-tower discussion.
Trap 3: Saying "use a neural network" without specifying architecture. Fix: MTL with shared bottom + task heads, or DCN-v2, or MoE — justify.
Trap 4: Only mentioning positive signals (like, comment, share). Fix: Always mention hide, report, unfollow as negative signals with negative weights.

Trap 5: Treating offline evaluation as sufficient. Fix: "Offline metrics have a selection bias problem — held-out data reflects the old ranking policy."
Trap 6: Ignoring the feedback loop. Fix: Proactively mention pathologies + exploration budget + long-term holdouts.
Trap 7: Not connecting to scale. Fix: Ground every architecture choice in the scale constraints from your clarifying questions. Parallelism, embedding serving, latency.

INTERVIEW GOLD POWER MOVE PHRASES

On objective function

"Proxy metric — engagement — diverges from true objective — long-term value — at the extremes. Outrage maximises engagement but destroys trust. This is why Meta uses survey-based labels to calibrate ranking weights, not just behavioural signals."

On candidate generation

"Ranker quality is bounded by candidate pool quality. A perfect ranker over a bad candidate set still produces a bad feed. This is why I'd invest heavily in multi-source candidate fusion with two-tower retrieval rather than relying purely on social graph traversal."

On offline evaluation

"Offline metrics have a fundamental selection bias problem — we can only evaluate on posts the current system chose to show. A model that surfaces completely different content looks worse offline even if it would be better in production."

On feedback loops

"I'd build in an exploration budget from day one — treat it like a multi-armed bandit. Every iteration the model trains on data it generated itself. Without intervention this amplifies popularity bias and creates filter bubbles."

LAYER 09

Core Concepts

▼

⚡ 60-SECOND SUMMARY Minimise brute-force memorisation. Bucket A (cold recall) = facts and formulas with no shortcut. Bucket B (derive on fly) = understand the WHY and reconstruct under pressure. Bucket C (revision anchors) = the connective tissue. The master insight: everything in the system exists because engagement ≠ long-term value. The system is a funnel. If you understand that one tension and can reason from first principles, you can reconstruct any component in the interview.

🔴 BUCKET A — COLD RECALL

Formula: V_ijt = Σ_k [w_ijtk · Y_ijtk(X_ijt)]

Feature buckets: WHO (user) / WHAT (content) / HOW (context) / CROSS (most powerful)

4 passes: Integrity → Pass 0 (1000→500) → Pass 1 (full net) → Pass 2 (diversity)

2 bumping: unread + action

3 storage: actions / objects / summary store

MTL solves: label sparsity + shared structure + inference cost

5 pathologies: popularity bias, filter bubbles, engagement bait, concept drift, cold start loop

3 label biases: position, selection, conformity

Eval pipeline: offline → shadow → A/B (primary/secondary/guardrail) → long-term holdout

TikTok vs Meta: interest vs social graph, watch time vs likes, homogeneous vs heterogeneous

3 post-2021 forces: TikTok effect + foundation models + real-time infra

ANN algorithms: FAISS / ScaNN / HNSW — O(log N)

Two weight sets: λ_k (training loss) ≠ w_k (ranking) — don't confuse

🟢 BUCKET B — DERIVE ON FLY

Why MTL? → label sparsity + shared structure + inference cost. Never memorise, derive.

Why multi-pass? → expensive models × 1000 posts. Pass 0 = cost gate. Pass 2 = list-level constraints impossible in Pass 1.

Why embeddings? → sparse IDs can't one-hot at 2B scale. Semantic similarity = geometric closeness.

Why surveys? → implicit labels measure clicks not value. Surveys directly measure perceived value. Used to calibrate w_k.

Why offline metrics insufficient? → selection bias: held-out data reflects old policy, not new model's outputs.

Why integrity before ranking? → prevents model learning from violated content engagement.

Why feature stores? → train-serve skew silently degrades model. Same feature logic for both = consistency.

Why Pass 2 separate? → diversity is list-level, can't apply when scoring posts independently.

Why two-tower for retrieval not ranking? → dot product is ANN-compatible. Cross-attention is not.

Why content embeddings matter for cold start? → semantic signal even before any engagement exists.

🔵 BUCKET C — REVISION ANCHORS

Master insight: Everything exists because engagement ≠ long-term value. One tension, entire system.

The funnel: Millions → 1000 (retrieval) → 500 (Pass 0) → 500 scored (Pass 1) → diversity-adjusted → 20-50 rendered. Each stage: recall at input, precision at output.

Social graph shifting: 2018: gate. 2021: primary. 2024: one signal among many. Design new systems with direction of travel in mind.

Real-time pragmatism: Full online learning = unstable. Pure batch = stale. Meta's solution: batch weights + real-time feature updates via summary store.

Every component has a "why not simpler" answer: If you know why Pass 0 exists over just running Pass 1 on all 1000, you're thinking right.

Guardrails define the floor: Engagement up + integrity down = not shippable. Never frame evaluation as "did engagement go up?"

Feedback loops are silent killers: System works at launch, slowly breaks itself. Proactive design (exploration budget, diversity, holdouts) not reactive firefighting.

⚡ PRE-INTERVIEW EXECUTION CHECKLIST

Ask clarifying questions before designing (scale, objective, content scope, constraints)

Start with objective function — V_ijt formula and motivation for every term

Build the system as a funnel — name all stages and why each exists

For every component: what it does + why it exists + what simpler alternative was rejected + why

Mention negative signals (hide, report) — most candidates forget entirely

Mention survey-based label calibration — almost no candidates know this

Name feedback loop pathologies when discussing monitoring

Caveat offline metrics: selection bias + need for A/B with guardrail metrics

Connect to evolution: TikTok effect, foundation models, streaming feature stores

End with: "What I'd prioritise if building this today vs 2021"

Facebook News Feed Ranking

The Problem & Landscape

The Math & Objective Function

System Architecture

ML Models Deep Dive

Candidate Generation & Filtering

Training, Evaluation & Feedback Loops

Evolution Since 2021 — State of the Art

Design Thinking

Core Concepts