ML Systems · Recommendation

Deep Neural Networks for YouTube Recommendations

Notes on the 2016 Google RecSys paper by Covington, Adams, and Sargin — a landmark that shaped how large-scale recommendation systems are built. The first half traces the architecture as published; the second half tracks what the field has learned since.

Source Covington, Adams, Sargin — RecSys 2016
Published May 2026
Covers Architecture · Systems · Evolution
Contents
Part II — Where Things Stand Today
Phase 01

The Big Picture

Why This Paper Matters & The Three Core Challenges

YouTube's problem is information retrieval at extreme scale with personalization. At the time of publication: ~800M+ videos in corpus, 1B+ users each with different taste and context, strict serving latency of tens of milliseconds, and a corpus growing by hours of content every second.

The naive approach — score every video for every user — is computationally impossible. The entire system is designed around one central insight:

You can't be precise at scale, so you decompose into stages: be fast and broad first, then slow and precise on a small set.

The three challenges the paper identifies each force a specific architectural decision:

ChallengeCore ProblemArchitectural Response
ScaleMillions of videos, billions of users, real-time servingTwo-stage funnel, ANN at retrieval, separate models per stage
FreshnessNew content every second; historically-trained models bias toward old contentExample age feature, continuous training pipelines, exploration
NoiseImplicit signals only; no ground truth satisfaction; metadata quality lowWatch time label over CTR, per-user loss weighting, robust surrogates
Mini-Note 1.1
🔴 Memorize Cold
  • Two-stage funnel: Retrieve (millions→hundreds) → Rank (hundreds→dozens)
  • Scale challenge → forces ANN and stage decomposition
  • Freshness challenge → forces example age feature
  • Noise challenge → forces watch time label + per-user weighting
🟢 Derive From First Principles
  • If scale is extreme → can't score all items → need retrieval
  • If retrieval is broad → needs second pass for precision → two stages
  • Volume of implicit feedback dominates explicit at tail → use implicit

The Two-Stage Architecture: Data Flow & Topology

Think of the system as a progressive filter with increasing cost per item:

Millions of Videos
CHEAP · fast · broad
Candidate Generation
~500 candidates · high recall
Ranking
EXPENSIVE · slow · precise
~Dozens Shown to User

Key Connection Points Between Stages

The Feedback Loop Problem: Training only on recommendation-driven watches creates a closed loop — the model reinforces its own biases, new content is starved of signal, filter bubbles form. The paper's fix: train on ALL YouTube watches regardless of source.

Offline vs. online metric divergence is a fundamental property of recommendation systems. Offline metrics (precision, recall, MAP) are for iteration speed. A/B testing is the ground truth. They don't always agree — because offline evaluates on data generated by the old model, while the new model changes what data gets generated.

Phase 02

Candidate Generation

Recommendation as Extreme Multiclass Classification

The paper reframes recommendation as a classification problem. Given user U in context C, what is the probability they watch video i next?

softmax P(wt = i | U, C) = exp(vi · u) / Σj∈V exp(vj · u)

Where u ∈ R^N is the user embedding produced by the DNN, and v_i ∈ R^N are video embeddings. The network learns user embeddings such that dot product with relevant video embeddings is maximized.

Why Classification and Not Regression or Pairwise Ranking?

Alternatives
  • Regression: what label? sparse explicit ratings don't scale
  • Pairwise ranking: combinatorially expensive at millions of videos
  • Hierarchical softmax: tried and rejected — arbitrary tree splits hurt accuracy
Why Classification Wins
  • Natural label: the video the user actually watched next
  • Produces embeddings as a byproduct — reusable for ANN serving
  • Scales with negative sampling (100x+ speedup)
  • Non-linear generalization of matrix factorization

Sampled Softmax

Standard softmax over millions of classes is infeasible. The solution: sample ~thousands of negatives from the background distribution, compute cross-entropy loss only over this small set, and apply importance weighting to correct for sampling bias.

For each training example (user watched video i):
  positive:  video i
  negatives: ~thousands sampled randomly from corpus
  
Loss = cross-entropy over {positive ∪ sampled_negatives}
Correction: importance weighting for non-uniform sampling

Speedup: 100x+ over full softmax
Mini-Note 2.1
🔴 Memorize Cold
  • Sampled softmax: ~thousands negatives, importance weighted, 100x speedup
  • User embedding = DNN output layer (256-dim)
  • Video embedding = softmax output layer weights (same 256-dim space)
  • Why not hierarchical softmax: arbitrary tree splits degrade accuracy
🟢 Derive
  • Why classification → produces embeddings for ANN serving as a free byproduct
  • Why implicit feedback → volume wins at tail; explicit feedback too sparse
  • Why not pairwise ranking → combinatorially expensive at millions

Model Architecture: Embeddings, Averaging & the Tower Network

The core challenge: take heterogeneous, variable-length, sparse inputs and produce a fixed-size dense vector (the user embedding). Here is how each input type is handled:

Step 1: Sparse ID → dense embedding via lookup table
  Video ID (integer) → E ∈ R^(|V| × 256) → dense vector ∈ R^256
  1M videos × 256 dims = 256M params (bulk of model)

Step 2: Variable-length sequence → fixed vector via averaging
  [v1_emb, v2_emb, ..., v50_emb] → average → single 256-dim vector
  Tried: sum, max, average → averaging won

Step 3: All features concatenated → wide first layer
  watch_avg [256] + search_avg [256] + geo [32] + device [32] + scalars
  → wide ~600-800 dim concatenated vector

Step 4: Tower of ReLU layers
  Wide input → 2048 → 1024 → 512 → 256 (user embedding u)
Key non-obvious detail: The output layer weights of the softmax classifier ARE the video embeddings v_i. They live in the same 256-dim space as the user embedding — which is exactly why dot product similarity works for ANN retrieval at serving time.
Architecture ChoiceWhyTrade-off
Averaging over RNNFast, parallelizable, no sequential dependencyLoses sequence order information
Wide → narrow towerInformation bottleneck; compress to meaningful 256-dimDimensionality limits representation capacity
Shared embedding tables3x+ parameter reduction; rare entities generalize betterSame base; layers above learn role-specific transforms
Separate feature inputs despite shared embeddingsSame video plays different semantic roles in different featuresMore complex input layer

Depth experiments from the paper confirmed that each additional layer adds non-linear interaction modeling. Best configuration: 1024 → 512 → 256 ReLU layers.

Heterogeneous Signals: What Goes In and Why

Key advantage of DNN over matrix factorization: MF can only use the user-item interaction matrix. DNN can consume anything representable as a vector.
SignalWhat It CapturesRepresentationKey Design Choice
Watch historyLong-term tasteAvg of 50 video embeddingsALL watches (not rec-driven); bag of 50
Search historyShort-term intentAvg of unigram+bigram embeddingsUnordered bag; bigrams capture phrases
GeographyRegional priorsLearned embedding ~32-dimHigh cardinality → embedding not scalar
DeviceContext / consumption modeLearned embedding ~32-dimMobile vs desktop preferences differ
Age / GenderDemographic priorsNormalized scalar [0,1]Binary/continuous → direct scalar input
Example ageFreshness correctionScalar = t_max − t_watchSet to 0 at serving (see 2.4)

Demographics serve as priors for cold start. A new user with no watch history gets a zero embedding for watch history — the model gracefully falls back to demographic-based population priors. As history accumulates, the model automatically downweights demographics.

Notably absent at candidate generation: video content features (title, description, audio), social signals, explicit feedback. Content features appear in modern systems but not in this 2016 architecture.

The Example Age Feature: Correcting Historical Bias

Precise definition: Example age = t_max − t_watch, where t_max is the last day of the training window and t_watch is when the watch event occurred. It is the age of the training example, not the age of the video.

The Problem: Sampling Bias Toward the Past

Within a 6-week training window, a video uploaded 5 weeks ago has 5 weeks of watch events. A video uploaded 3 days ago has 3 days. The model naively learns "old popular videos get watched more" — a sampling artifact, not reality.

The Fix: Teaching the Model About Time

During training:
  example_age = t_max - t_watch   (e.g., 0 to 42 days for 6-week window)
  Model learns: small example_age ↔ fresh content popularity spike
                large example_age ↔ established stable popularity

At serving:
  example_age = 0  (or slightly negative)
  Tells the model: "predict for right now"
  Model applies the fresh-content part of the learned popularity curve
What It IS
  • Age of the watch event (training example)
  • Correction for sampling bias
  • Teaches time-dependent popularity
  • Set to 0 at serving = "right now"
What It is NOT
  • Age of the video (upload date)
  • A freshness filter or score
  • Video-level recency signal
  • User recency signal
Mini-Note 2.4
🔴 Memorize Cold
  • example_age = t_max − t_watch (NOT video age)
  • Set to 0 (or slightly negative) at serving time
  • Fixes sampling bias, not video-level staleness
🟢 Derive
  • Why bias exists → more training examples for older videos due to elapsed time
  • Why setting 0 at serving works → applies fresh-content popularity curve learned during training

Label & Context Selection: The Surrogate Problem

The paper's strong claim: "The choice of this surrogate learning problem has an outsized importance on performance in A/B testing but is very difficult to measure with offline experiments."

Getting the training setup right matters more than model architecture.

The Rollback Fix

WRONG — Random Holdout:
  History: v1 v2 v3 [v4] v5 v6   (v4 held out)
  Input:   v1 v2 v3     v5 v6    ← sees future events!
  Problem: future leakage + ignores asymmetric consumption

CORRECT — Rollback (future watch prediction):
  Choose random cutpoint t in history
  Input:  everything BEFORE t       (v1 v2 v3)
  Label:  the NEXT watch AFTER t    (v4)
  Excluded: v5, v6                  (strict future — never seen)

The Search Query Contamination Problem

Including the most recent search query as a feature when training a homepage model causes the model to learn: "after searching X, recommend X results." It reproduces the search results page as homepage recommendations — disastrous in A/B tests.

Fix: Represent search queries as unordered bags of tokens. Discard sequence and recency. The model can no longer connect a specific search to specific results.

Four Data Pipeline Fixes

FixProblem Solved
Train on ALL watches (not rec-driven)Closed feedback loop; new content starvation
Per-user example capping (fixed N)Power user dominance; equal loss weighting
Rollback label (future watch)Future leakage; asymmetric consumption patterns
Unordered bag of tokens for searchSearch query contamination on homepage model
Meta-lesson: The biggest wins in this system came from fixing how data is used, not from making the model more complex. Data fixes often beat model fixes.

Serving: Approximate Nearest Neighbor, Not Softmax

The softmax denominator is constant across all videos — it doesn't affect ranking. So top-N retrieval reduces to finding video embeddings with the highest dot product with the user embedding. This is Maximum Inner Product Search (MIPS) — a nearest neighbor problem.

Top-N retrieval = argmaxi (vi · u) — a MIPS problem, not full softmax
ANN AlgorithmApproachQuery TimeTrade-off
LSH (2016 approach)Locality sensitive hashing to bucketsO(1) amortizedRecall degrades in high dimensions
HNSWMulti-layer proximity graph navigationO(log N)High memory; excellent recall/speed
IVFPQCoarse clusters + product quantizationO(1) approxMemory efficient; good for billion scale
ScaNN (Google prod)Anisotropic quantization for MIPSO(log N)2x faster than HNSW; optimized for dot product

The A/B results were not sensitive to the choice of ANN algorithm — because the ranking stage downstream corrects retrieval errors. The two-stage design provides error tolerance: retrieval doesn't need to be perfect.

Serving Pipeline

User request
  → User embedding DNN forward pass (~5ms)
  → ANN lookup against pre-built index (~5ms)
  → Top-500 candidate video IDs
  → [Ranking stage]
Train/serve asymmetry: Training uses sampled softmax; serving uses ANN dot product. This works because sampled softmax IS essentially dot product optimization — but it's an implicit assumption. Modern two-tower models address this explicitly with contrastive loss.
Phase 03

Ranking

Why a Separate Ranker?

DimensionCandidate GenerationRanking
Items scoredMillions~Hundreds
Compute per item~0.1ms max~1ms+ affordable
Feature availabilityUser-side only (don't know specific item yet)Full (user × specific video) cross features
ObjectiveBroad relevance recallPrecise engagement prediction
OutputUser embedding for ANNScalar score per video
LossSampled softmax (classification)Weighted logistic regression

Why Not CTR? The Clickbait Problem

CTR measures the quality of the thumbnail and title. Watch time measures the quality of the video itself. You can trick a click with a deceptive thumbnail — you cannot trick 8 minutes of attention.

The four jobs of the ranker beyond simple scoring: precise scoring with rich features, context specialization (homepage vs. watch-next vs. mobile), source blending (normalize candidates from CF, trending, search), and churn introduction (demote repeatedly ignored impressions).

Feature Taxonomy

The ranking model uses hundreds of features organized along three independent taxonomic dimensions:

DimensionTypesKey Point
Value typeCategorical vs. ContinuousCategorical → embeddings; Continuous → quantile normalization
CardinalityUnivalent (single value) vs. Multivalent (set)Multivalent → average embeddings before feeding to network
Compute timingQuery (user/context) vs. Impression (video-specific)Query computed once per request; Impression computed per candidate

Feature Importance Hierarchy

The paper is explicit: the most important signals are those describing a user's previous interaction with the item and similar items.

Query vs. Impression compute pattern: Query features are computed ONCE per request and amortized across all ~500 candidates. Impression features are computed per candidate — their cost scales linearly. This guides where to invest in feature computation latency budget.

Shared Embeddings & Quantile Normalization

Shared Embedding Spaces

Multiple features reference the same entity (e.g., video ID of the impression, last watched video ID, seed video ID). One global embedding table per ID space; all features referencing the same space share the table. But each feature is fed separately into the network.

Why Share
  • 3x+ parameter reduction (critical for memory)
  • Rare entities get gradient from multiple feature contexts
  • Semantic consistency across feature roles
Why Feed Separately
  • Same video plays different roles (impression vs. seed vs. last-watched)
  • Layers above learn role-specific transformations
  • Shared base + separate inputs = best of both worlds

OOV (out-of-vocabulary) values → zero embedding. Vocabulary truncated to top-N by frequency in clicked impressions. Embedding dimension scales with log(cardinality).

Quantile Normalization for Continuous Features

Neural networks are sensitive to input scale. Standard z-score fails for power-law distributed features (view counts, watch times) — outliers compress everything else.

x̃ = F(x) = ∫-∞x p(t) dt (CDF transform → uniform [0,1))

After computing x̃, feed three versions of each continuous feature: , x̃², √x̃. This gives the network cheap access to super-linear and sub-linear relationships without manual feature engineering. Paper result: removing powers increased loss by 0.2%.

Mini-Note 3.3
🔴 Memorize Cold
  • Quantile norm: CDF transform → uniform [0,1)
  • Powers: feed x̃, x̃², √x̃ for every continuous feature
  • OOV → zero embedding (neutral signal)
  • Dim ∝ log(cardinality)
🟢 Derive
  • Why shared embeddings generalize better → gradient from multiple feature contexts
  • Why quantile over z-score → power law distributions; outlier robustness
  • Why powers → cheap non-linear expressivity without more model capacity

Modeling Expected Watch Time via Weighted Logistic Regression

Each training example is either a positive (shown, clicked, watched T_i seconds) or negative (shown, not clicked). The modification: weight positive examples by their observed watch time T_i; keep negative weight = 1.

Positive example (clicked, watched T_i seconds):
  Loss weight = T_i      ← watch time in seconds

Negative example (not clicked):
  Loss weight = 1        ← unit weight

Same cross-entropy loss, same architecture. One change.

What the Model Actually Learns

Working through the math: with watch-time weighting, the learned log-odds are approximately:

Learned odds ≈ E[T] × (1 + P) ≈ E[T] (since click probability P ≪ 1)

The model learns odds that closely approximate expected watch time E[T], not click probability. At serving time, using e^{f(x)} (exponential activation) directly outputs these odds — the quantity you want to rank by.

Empirical Validation

ConfigurationWatch-time Weighted Pairwise Loss
No hidden layers (linear)41.6%
512 → 256 ReLU35.2%
1024 → 512 → 256 ReLU34.6%
Equal weighting (positive = negative = 1)+4.1% vs. watch-time weighted
Elegant insight: This is a poor man's multi-task model — it implicitly learns both P(click) and E[T|click] simultaneously through a single weighted loss, without explicit two-head architecture. Modern systems make this explicit with MTL.
Mini-Note 3.4
🔴 Memorize Cold
  • Positive weight = T_i (seconds watched)
  • Negative weight = 1
  • Serving activation = e^{f(x)}, NOT sigmoid
  • Equal weighting → +4.1% worse pairwise loss
  • Learned odds ≈ E[T] (since P ≪ 1)
🟢 Derive
  • Why watch-time weighting → high T_i positives dominate gradient → push toward E[T]
  • Why exponential at serving → logistic odds = e^f, learned odds ≈ E[T]
  • Why not direct regression → skewed distribution + class imbalance makes classification more stable
Phase 04

Systems Perspective

Training Pipeline

Data Generation Decisions

DecisionChoiceReason
Which watchesALL watches (not rec-driven)Break feedback loop; new content surface
Examples per userFixed cap NEqual loss weighting; prevent power user dominance
Label typeFuture watch (rollback)No future leakage; asymmetric patterns
Training windowSeveral weeksSignal for rare items; recent enough

Different Negatives for Different Stages

Retrieval Negatives
  • Random from corpus
  • Importance weighted
  • Learns: relevant vs. random noise
  • Easy but effective at corpus scale
Ranking Negatives
  • Shown to user but NOT clicked
  • Much harder — passed retrieval filter
  • Learns: good impression vs. plausible-but-bad
  • Critical distinction for ranking quality
Most common silent failure: Training/serving skew — features computed differently at training vs. serving time. Silent degradation, hard to detect. Fix: use the same feature computation code (library) for both training and serving.

Distributed Training

Parameter server architecture with asynchronous SGD. Embedding tables (256M+ params) sharded across parameter servers. Workers fetch only the embedding rows needed per batch. Async SGD trades slightly stale gradients for much faster wall-clock time — the right trade-off at thousands of workers.

Offline vs. Online Evaluation

"Live A/B results are not always correlated with offline experiments." — this is not a minor caveat. It is a fundamental property of recommendation systems.

Four Sources of Divergence

Offline Metric Design Principle

Design offline metrics that mirror the online objective as closely as possible. YouTube's ranking offline metric: watch-time weighted pairwise loss — matches training objective → better correlation with online results.

Metric Hierarchy

LevelMetricsImportanceDifficulty
Long-termDay-N retention, DAU/MAUHighestHardest to measure
SessionSession length, depthHighMedium
EngagementWatch time, completion rateMediumEasy
SatisfactionSurveys, dislike rateHighSparse
Goodhart's Law: Every metric, if optimized hard enough, gets gamed by the model finding shortcuts. Watch time → autoplay traps. CTR → clickbait. Session length → rabbit holes. This is why multi-metric evaluation and human spot-checks are necessary.

Serving Infrastructure

End-to-End Latency Budget

Stage                    Items        Latency    Cost/item
────────────────────────────────────────────────────────
Feature lookup (user)    1            ~5ms       5ms
User embedding DNN       1            ~5ms       5ms
ANN index lookup         1M → 500     ~5ms       0.005ms
Feature lookup (video)   500          ~10ms      0.02ms
Ranking DNN scoring      500          ~20ms      0.04ms
Sort + filter            500          ~1ms       —
────────────────────────────────────────────────────────
Total                                 ~46ms

ANN: 0.005ms/item vs Ranking: 0.04ms/item → 8x more expensive
→ Confirms why ranking DNN can't run at retrieval scale

Feature Store Architecture

The feature store unifies batch (daily aggregates) and real-time (per-event streaming) features behind a single low-latency lookup interface. The critical rule: same feature computation code at training and serving time — prevents training/serving skew.

Caching Strategy

What to CacheTTLReason
User embeddings~5 minutesTaste changes slowly; DNN forward pass saved
Video featuresMinutes to hoursView count, category change slowly
Impression historyDO NOT CACHEMust be fresh; stale = wrong churn behavior
Final ranked listDO NOT CACHEPersonalization must be per-request

Graceful Degradation

Post-processing layer: Between model output and what users see: de-duplication, diversity enforcement, policy filters (age restrictions, geographic licensing), ad insertion. This is rule-based, not ML. Always distinguish model output from final serving output.
Part II — Where Things Stand Today

The 2016 paper established the blueprint. What followed was a decade of refinement — better retrieval models, richer user representations, multi-objective training, and entirely new serving paradigms. This section traces that arc and examines the current state across the industry.

Phase 05

Evolution & State of Art

What Changed at YouTube Post-2016

2016
Single DNN retrieval + Single DNN ranking. Average pooling. Watch time objective. Random negatives. This paper.
2018
Multi-task learning introduced. MMoE (Multi-gate Mixture of Experts) solves conflicting gradient problem across tasks. Multiple objectives blended via weighted sum.
2019
Two-tower model formalized. Sequential attention replaces averaging. Sampling bias correction (subtract log p(v)). In-batch negatives. Position bias correction via shallow tower.
2020
Multimodal content embeddings for cold start. Hard negative mining (iterative). ScaNN published as production ANN algorithm.
2021+
Transformer-based user modeling. RL/bandit for exploration. Real-time feature updates (seconds). Continuous training pipelines.

MMoE — Solving Conflicting Objectives

Watch time alone creates pathological behavior (rabbit holes, autoplay traps). MMoE introduces task-specific gating networks over a mixture of expert networks — each task learns which experts to weight, allowing conflicting tasks to use different expert combinations without interfering.

Final ranking score:
  = w1 × E[watch_time]
  + w2 × P(like)
  + w3 × P(share)
  - w4 × P(dislike)
  - w5 × P(skip)

Weights tuned via A/B testing.
These values are closely guarded at any company.

Industry Evolution: Facebook, TikTok, Twitter

PlatformKey InnovationPrimary ObjectiveUnique Signal
Facebook/MetaDLRM: explicit dot product feature crosses + hybrid CPU-GPU for terabyte embeddingsCTR + meaningful social interactionsSocial graph engagement
TikTokCascade new content expansion; Monolith real-time embedding updatesWatch completion rate (% not seconds)Pure exploration — no social graph required
Twitter/XSimClusters community detection; GNN user modelingEngagement + recencySocial proof (followed accounts' engagement)
YouTube (modern)MMoE multi-task; transformer sequence; ScaNN retrievalMulti-objective blendLong-form watch time + satisfaction surveys

TikTok's Cascade Expansion (Cold Start Innovation)

New video uploaded
  → Shown to small random cohort (~few hundred users)
  → Measure: completion rate, likes, shares
  → If signals good → expand to larger cohort (×10)
  → Cascade until natural audience size reached
  → Poor signals → stays in long tail

Result: every creator gets a fair chance regardless of size.
        Small creator → viral is structurally possible.
        No content features needed — real engagement bootstraps CF.

Key Architecture Differences

Retrieval Evolution: Two-Tower → HNSW → ScaNN

Model Side Evolution

StageUser RepresentationTraining SignalKey Paper
2016Single vector, avg poolingRandom negatives + importance weightingThis paper
2019Symmetric two-towerIn-batch negatives + sampling bias correctionGoogle RecSys 2019
2020Two-tower + hard negativesIterative hard negative miningFacebook, Google
2022+Multi-vector (per interest cluster)Contrastive + MaxSim scoringColBERT-inspired

Index Side Evolution

HNSW: Multi-layer proximity graph. O(log N) query. Greedy navigation: start at sparse top layer, descend to dense bottom layer. Best recall/speed trade-off. High memory (full vectors in RAM). Industry standard via hnswlib, Faiss, Pinecone, Qdrant.

ScaNN (Google 2020): Anisotropic quantization — minimizes error specifically in the direction that affects dot product (MIPS), not uniform reconstruction error. 2x faster than HNSW at same recall. Production system at Google for YouTube retrieval.

Standard PQ:  minimize ||v - v̂||²  (uniform, direction-agnostic)
ScaNN:        minimize error in direction affecting u·v  
              → directionally aware → better MIPS accuracy per bit

Sampling bias correction (2019): Popular items appear too often as negatives → model under-recommends them. Fix: corrected_score = u·v − log(p(v)) where p(v) is item frequency in training data.

Ranking Evolution: DCN, Feature Crosses, Transformers

The Feature Interaction Problem

Deep MLPs can learn interactions but do so inefficiently — no inductive bias toward feature crosses, need enormous capacity, interactions are combinatorially explosive. This motivated explicit interaction modeling.

ModelKey InnovationYear
Wide & DeepManual cross + MLP; memorization + generalizationGoogle 2016
DeepFMFM replaces manual wide; shared embeddings; automatic crossesHuawei 2017
DCNCross network: explicit polynomial feature interactions without combinatorial explosionGoogle 2017
DINCandidate-aware attention over user history; user vector adapts per candidateAlibaba 2018
DCN v2Full matrix cross layer (vs. rank-1); richer interactionsGoogle 2021
SASRecCausal transformer for sequential recommendation; O(n²) attention2018

DCN Cross Network (Key Formula)

xl+1 = x0 · (wl · xl) + bl + xl After L layers → degree L+1 feature interactions

DIN: Candidate-Aware User Representation

Standard (YouTube 2016):
  User vector = avg(all history embeddings)
  → SAME vector regardless of candidate being scored

DIN:
  For candidate item c:
    a_i = attention(history_item_i, candidate_c)
    user_vector = Σ a_i × history_item_i_embedding
  → DIFFERENT user vector per candidate
  → Cooking video candidate → cooking history items get high weight
  → Sports video candidate → same history, different weights

Three-Stage Funnel (Modern Standard)

500 candidates from retrieval
      ↓
[Light Ranker: simple features, fast model ~1ms total]
      ↓
50 candidates
      ↓
[Heavy Ranker: transformer history, DCN v2, MTL ~20ms total]
      ↓
Final top-N

State of Art Today: Sequential Models & LLMs

Long Sequence Handling

Users on mature platforms have 50,000–100,000+ interaction histories. Quadratic transformer attention is infeasible. Three approaches:

LLMs in RecSys: Honest Assessment

Use CaseStatusWhy
LLM text embeddings as ranking features✅ Production deployedOffline, cached; cheap at serving; powerful for cold start
LLM for query understanding / intent✅ Production deployedNatural language → structured query; works well
Conversational recommendation✅ Early production (Rufus, YouTube AI)Multi-turn preference elicitation; genuinely new capability
LLM as direct ranker🔬 Research only500ms–2s latency; $0.01–0.10 per request — infeasible at scale
LLM replacing collaborative filtering❌ Not workingCF captures revealed preference; LLM captures semantic similarity — weaker signal
Why LLMs can't replace CF at scale: CF captures revealed preference (what 1M users actually watched). LLM captures semantic similarity (X and Y are about similar topics). Revealed preference is a stronger engagement signal than semantic similarity. "1M users who watched X also watched Y" >> "X and Y are semantically similar."

The state-of-art production architecture: LLM-augmented two-stage pipeline. LLM handles cold start (text embeddings), query understanding, explanation generation, and conversational refinement. The two-stage funnel remains the backbone.

Phase 06

Design Principles

System Design Frameworks

Requirements Scoping (Always Do This First)

Scale:      How many users? Items? RPS? Latency budget?
Objective:  Watch time? CTR? Multi-objective? Guardrails?
Context:    Homepage vs. watch-next? Long-form vs. short-form?
Cold start: New users? New items? How frequent?

The Answer Structure That Works

Strong Answers for Common Probes

QuestionKey Points to Hit
Why two stages?Compute budget + feature availability at retrieval time + source blending. Derive from scale constraint.
New video cold start?Content embeddings (LLM) + freshness retrieval channel + cascade expansion + example age feature
New user cold start?Demographic priors → exploration to discover interests → CF takes over as history accumulates
Why watch time not CTR?CTR = thumbnail quality. Watch time = video quality. Implemented via T_i-weighted logistic regression → odds ≈ E[T]
How to evaluate?Offline: custom metric mirroring objective. Online A/B: watch time primary. Long-term: retention. Never conflate offline with online.
Failure modes?Training/serving skew (silent, dangerous), feedback loop, popularity bias, embedding staleness, distribution shift
How to improve?Hard negatives → sequential attention → multi-task MTL → cold start improvements → DCN v2 → three-stage funnel

The Meta-Skill: Framing Every Answer

1. State the problem first, then the solution. 2. Always state the trade-off. 3. Connect to business metrics. 4. Know the numbers. 5. Acknowledge what you'd measure before shipping.

Edge Cases & Common Misconceptions

Gotcha QuestionWhat the Interviewer is TestingCore Insight to Demonstrate
"What exactly is example age?"Whether you confuse it with video aget_max − t_watch (watch event age, NOT video age). Corrects sampling bias in training data.
"Why withhold search query?"Understanding surrogate problem gamingIncluding last search → model reproduces search results page on homepage. Unordered bag of tokens breaks the connection.
"Why rollback labels?"Understanding future leakageRandom holdout sees future events + ignores asymmetric consumption. Rollback mirrors serving conditions exactly.
"Why per-user capping?"Understanding loss function dynamicsPower law user distribution → power users dominate gradients without capping. Cap = equal loss weighting across users.
"Why not hierarchical softmax?"Deep understanding of the rejectionNo natural video hierarchy. Arbitrary tree splits make each binary decision harder. Sampled softmax avoids artificial structure.
"Walk me through the weighted LR math"Can you derive learned odds ≈ E[T]?T_i weights → learned log-odds ≈ Σ T_i / (N-k) ≈ E[T] × (1+P) ≈ E[T] since P ≪ 1
"Why train on all watches?"Understanding feedback loopsRec-driven only → closed loop → model reinforces own biases → new content starved → never discovered
"Train/serve asymmetry in retrieval?"Whether you notice this subtletyTrained with sampled softmax; served with ANN dot product. Works because softmax IS dot product optimization. Modern fix: contrastive loss + hard negatives.
"What's wrong with this paper?"Intellectual honesty and depthSingle objective (watch time gaming), averaging (sequence order lost), train/serve asymmetry, no feedback loop/fairness discussion, offline metrics underspecified

What to Memorize Cold vs. What to Derive

Tier 1 — Numbers & Formulas
Memorize These Exactly
Candidate generation:
  Embedding dim: 256 | Bag size: 50 | Vocab: 1M videos + 1M search tokens
  Best depth: 1024 → 512 → 256 ReLU | Negatives: ~thousands (100x speedup)

Ranking:
  Best config: 1024 → 512 → 256 ReLU | Pairwise loss: 34.6%
  Equal weighting: +4.1% worse | No hidden layers: 41.6%

Key formulas:
  example_age = t_max − t_watch  (set to 0 at serving)
  Positive weight = T_i (seconds) | Negative weight = 1
  Learned odds ≈ E[T] × (1+P) ≈ E[T]  (since P ≪ 1)
  Serving activation = e^{f(x)}  NOT sigmoid
  Quantile norm: x̃ = F(x) = CDF transform → [0,1)
  Powers: feed x̃, x̃², √x̃ for every continuous feature
Tier 2 — Named Techniques
NameOne-line Description
Two-towerSeparate user + item encoders; score = dot product u·v
MMoETask-specific gating over mixture of experts; handles conflicting objectives
DCN v2Cross network with full matrix; explicit polynomial feature interactions
DINAttention over user history weighted by relevance to candidate
SASRecCausal transformer for sequential recommendation; O(n²) attention
HNSWMulti-layer proximity graph; O(log N) ANN; high memory
ScaNNAnisotropic quantization for MIPS; 2x faster than HNSW; Google production
TWINTwo-stage inside user modeling: retrieve relevant history → attend (ByteDance)
MonolithReal-time embedding updates via collision-free hash map; no batch training boundary
SimClustersCommunity detection over social graph → sparse user representation (Twitter)
DLRMExplicit dot product feature crosses + hybrid CPU-GPU for terabyte embeddings (Facebook)
Sampled softmaxTrain over sampled negatives; 100x speedup; importance weighted
In-batch negativesOther batch items as negatives; efficient + harder than random
Tier 3 — Non-Obvious Design Choices
Cannot Be Derived — Must Know
  • Train on ALL watches, not rec-driven only → breaks feedback loop
  • Per-user example capping (fixed N) → equal loss weighting across users
  • Rollback label (future watch, not random holdout) → no future leakage
  • Video embeddings = softmax output layer weights (same 256-dim as user)
  • Shared embedding tables, separate feature inputs → efficiency + role specialization
  • Quantile normalization (CDF), not z-score → handles power law distributions
  • Powers of continuous features: x̃, x̃², √x̃
  • Serving activation = e^{f(x)}, not σ(f(x))
  • Example age = 0 at serving (not at training)
  • Search queries as unordered bag of tokens (no sequence, no recency)
  • OOV → zero embedding (neutral, uninformative)
  • Retrieval negatives = random; Ranking negatives = shown but not clicked
Tier 4 — Derive From First Principles
QuestionDerivation Path
Why two stages?Scale → can't afford rich computation per item at millions → cheap broad retrieval → expensive precise ranking on small set
Why ANN?Denominator doesn't affect ranking → MIPS problem → exact O(|V|) infeasible → approximate with tolerable accuracy loss → ranking corrects errors
Why implicit feedback?Explicit ratings sparse → can't train tail content → implicit (completions) 1000x more data → volume wins despite noise
Why watch time not CTR?CTR → clickbait problem → CTR = thumbnail quality → watch time = video quality → harder to game → better satisfaction proxy
Why independent scoring?Joint scoring = exponential complexity → independent = embarrassingly parallel → batch all candidates → fast serving
Why shared embeddings?Same entity in multiple features → separate tables = redundant + sparse training → shared = 3x reduction + more gradient per entity
Why quantile normalization?Power law distributions → z-score fails → outliers compress everything → CDF maps any distribution to uniform [0,1) → well-conditioned inputs

The Five Connecting Patterns

Pattern 1: Scale is the master constraint — it determines everything else. At millions of items you can't have rich features. At hundreds you can.
Pattern 2: Data fixes often beat model fixes. Example age, rollback labels, per-user capping, all-watches policy — all data pipeline decisions with outsized impact.
Pattern 3: Every design choice is a cost/benefit trade-off. Always state both sides: averaging vs. RNN, random vs. hard negatives, shared vs. separate embeddings, single vs. multi-task.
Pattern 4: The surrogate chain has many links. True objective → surrogate metric → training label → model output → offline metric → online metric → true objective. Every link is an approximation. Know where the failures hide.
Pattern 5: Offline metrics ≠ online results, fundamentally. Offline is for iteration speed. A/B is ground truth. Never optimize offline metrics in a vacuum.

Quick Reference

"Design YouTube/TikTok recommendation"
Two-stage funnel → retrieval (two-tower, ANN) → ranking (DCN+DIN, multi-task) → training (rollback, all watches, per-user cap) → evaluation (custom offline + A/B watch time)
"Why two stages?"
Compute budget + feature availability (don't know item at retrieval time) + source blending. Derive from: scale → can't afford rich computation → stage decomposition.
"New video cold start?"
LLM text embeddings (available immediately) + separate freshness retrieval channel + cascade expansion (TikTok-style small cohort → measure → expand) + example age feature
"New user cold start?"
Demographic priors (zero vector → graceful fallback) → exploration to discover interests → CF takes over. Onboarding questions accelerate Phase 1→2.
"Why watch time not CTR?"
CTR = thumbnail quality (gameable by clickbait). Watch time = video quality (can't fake 8 minutes). Implemented via T_i-weighted logistic regression → odds ≈ E[T]. e^f(x) activation at serving.
"How to keep fresh?"
example_age = t_max − t_watch (set 0 at serving) + real-time feature store (Kafka → Flink → Redis) + continuous training + separate freshness retrieval channel
"How to evaluate?"
Offline: custom metric mirroring objective (watch-time pairwise loss). Online A/B: watch time primary, retention long-term. Offline ≠ online — trust A/B for final decisions.
"What can go wrong?"
Training/serving skew (silent, most dangerous) + feedback loop (rec-driven training only) + popularity bias (sampling bias correction) + embedding staleness + distribution shift
"How to improve this system?"
Hard negatives → DIN/transformer user modeling → MMoE multi-task → cold start improvements → DCN v2 feature interactions → three-stage funnel → position bias correction
"What's wrong with this paper?"
Single watch time objective (gameable) + averaging loses sequence + train/serve asymmetry + no feedback loop/fairness discussion + offline metrics underspecified