LLMs

RLHF

Optimising what people prefer, with a leash to stop the cheating

01 · First principlesLikelihood is not preference

SFT maximises the probability of demonstrations, and that objective has a ceiling built in: the model can at best become its demonstrators, hedges and errors included. Worse, likelihood cannot represent the judgment we actually care about. Between two valid answers — one rambling, one clear — SFT has no slot for "humans prefer the second"; both are just sequences with probabilities.

The missing signal is comparative. And there is a practical asymmetry that makes it cheap to collect: people are bad at writing ideal answers but quick and reasonably consistent at ranking two given ones (verifying is easier than generating). RLHF is the machinery for converting that weak, plentiful signal — A is better than B — into a training objective.

02 · Step oneA reward model from pairwise comparisons

Comparisons are not numbers, and optimisation needs numbers. The bridge is the Bradley–Terry model (1952, built for ranking chess players): assume each response has a latent score r(x, y), and that humans prefer the higher-scored response with probability given by the score gap squashed through a sigmoid:

P(y_w ≻ y_l) = σ( r(x, y_w) − r(x, y_l) )

score of preferred response score of rejected response

Fit r — an LLM with a scalar head, initialised from the SFT model — by maximum likelihood on a few hundred thousand human comparisons. Only score differences are identified (the absolute scale is arbitrary, like Elo), which is fine, since only differences will ever be used. The result is a learned, differentiable proxy for human judgment that can score unlimited new responses for free.

03 · Step twoPPO, and why the KL leash is the load-bearing part

Now optimise the policy to produce high-reward responses. The naive objective — maximise E[r] alone — fails in a characteristic way: the reward model is a proxy, accurate near the data it was trained on and increasingly wrong far from it, and an optimiser is a machine for finding exactly the inputs where its objective is most wrong. Push hard enough and the policy discovers adversarial responses the reward model loves and humans do not — obsequious boilerplate, confident padding, eventually degenerate text. This is reward hacking, and it arrives reliably, not occasionally. The fix is to penalise distance from the SFT policy:

maximise E_y∼π[ r(x, y) ] − β · KL( π(·|x) ‖ π_SFT(·|x) )

proxy reward the leash: stay near the SFT policy

The KL term confines the search to the region where the reward model is trustworthy, and it also fights mode collapse — without it, the policy funnels toward the single highest-scoring phrasing and diversity dies. PPO is the particular RL algorithm used to climb this objective stably (clipped updates so the policy never moves too far per step); the conceptual content is the leash, not the climber.

The PPO loop. Four model copies in memory (policy, reference, reward model, value function) is part of why this pipeline is operationally heavy.

04 · The shortcutDPO: skip the reward model

The KL-constrained objective has a known closed-form optimum: π*(y|x) ∝ π_SFT(y|x)·e^r(x,y)/β. Read backwards, this says any policy implicitly defines a reward, r = β log(π/π_SFT) + const. Substitute that into the Bradley–Terry likelihood and the reward model cancels out of the pipeline entirely, leaving a simple classification-style loss on preference pairs:

L_DPO = −log σ( β·log [π(y_w)/π_SFT(y_w)] − β·log [π(y_l)/π_SFT(y_l)] )

raise preferred, relative to anchor lower rejected, relative to anchor

No reward model, no sampling during training, no RL machinery — just gradient descent on pairs. The tradeoff: DPO only ever sees the fixed preference dataset, while PPO explores its own samples and gets graded on them, which matters as the policy drifts from the data. In practice DPO (and its variants) dominates open-source alignment on cost-effectiveness; the heavyweight online pipelines persist at the frontier — increasingly with verifiable rewards (does the code pass tests?) replacing the learned proxy where tasks allow, which deletes the reward-hacking problem at its source.

05 · CaveatsWhat the optimum optimises

RLHF makes the model better at being preferred, and human raters have exploitable habits: they reward confidence, length and agreement. Sycophancy and authoritative hedging are not bugs in the pipeline but faithful optimisation of the signal it was given. The method is only as good as the preferences, and the preferences are only as good as the attention of the people providing them — the alignment problem, recursed one level down.

Mental Model

SFT can only imitate; preference is comparative information that likelihood has no slot for. Ranking is also far cheaper to collect than writing.
Bradley–Terry turns pairwise comparisons into a scalar reward: P(w ≻ l) = σ(r_w − r_l).
The reward model is a proxy; unconstrained optimisers find its blind spots (reward hacking). The KL term to the SFT policy is the leash, and it also prevents mode collapse.
DPO: the KL-constrained optimum in closed form turns RLHF into a classification loss on pairs — same objective, no RL loop, but no exploration either.
The model becomes what the raters reward, including their biases toward confidence and flattery.