Optimising what people prefer, with a leash to stop the cheating
SFT maximises the probability of demonstrations, and that objective has a ceiling built in: the model can at best become its demonstrators, hedges and errors included. Worse, likelihood cannot represent the judgment we actually care about. Between two valid answers — one rambling, one clear — SFT has no slot for "humans prefer the second"; both are just sequences with probabilities.
The missing signal is comparative. And there is a practical asymmetry that makes it cheap to collect: people are bad at writing ideal answers but quick and reasonably consistent at ranking two given ones (verifying is easier than generating). RLHF is the machinery for converting that weak, plentiful signal — A is better than B — into a training objective.
Comparisons are not numbers, and optimisation needs numbers. The bridge is the Bradley–Terry model (1952, built for ranking chess players): assume each response has a latent score r(x, y), and that humans prefer the higher-scored response with probability given by the score gap squashed through a sigmoid:
Fit r — an LLM with a scalar head, initialised from the SFT model — by maximum likelihood on a few hundred thousand human comparisons. Only score differences are identified (the absolute scale is arbitrary, like Elo), which is fine, since only differences will ever be used. The result is a learned, differentiable proxy for human judgment that can score unlimited new responses for free.
Now optimise the policy to produce high-reward responses. The naive objective — maximise E[r] alone — fails in a characteristic way: the reward model is a proxy, accurate near the data it was trained on and increasingly wrong far from it, and an optimiser is a machine for finding exactly the inputs where its objective is most wrong. Push hard enough and the policy discovers adversarial responses the reward model loves and humans do not — obsequious boilerplate, confident padding, eventually degenerate text. This is reward hacking, and it arrives reliably, not occasionally. The fix is to penalise distance from the SFT policy:
The KL term confines the search to the region where the reward model is trustworthy, and it also fights mode collapse — without it, the policy funnels toward the single highest-scoring phrasing and diversity dies. PPO is the particular RL algorithm used to climb this objective stably (clipped updates so the policy never moves too far per step); the conceptual content is the leash, not the climber.
The PPO loop. Four model copies in memory (policy, reference, reward model, value function) is part of why this pipeline is operationally heavy.
The KL-constrained objective has a known closed-form optimum: π*(y|x) ∝ πSFT(y|x)·er(x,y)/β. Read backwards, this says any policy implicitly defines a reward, r = β log(π/πSFT) + const. Substitute that into the Bradley–Terry likelihood and the reward model cancels out of the pipeline entirely, leaving a simple classification-style loss on preference pairs:
No reward model, no sampling during training, no RL machinery — just gradient descent on pairs. The tradeoff: DPO only ever sees the fixed preference dataset, while PPO explores its own samples and gets graded on them, which matters as the policy drifts from the data. In practice DPO (and its variants) dominates open-source alignment on cost-effectiveness; the heavyweight online pipelines persist at the frontier — increasingly with verifiable rewards (does the code pass tests?) replacing the learned proxy where tasks allow, which deletes the reward-hacking problem at its source.
RLHF makes the model better at being preferred, and human raters have exploitable habits: they reward confidence, length and agreement. Sycophancy and authoritative hedging are not bugs in the pipeline but faithful optimisation of the signal it was given. The method is only as good as the preferences, and the preferences are only as good as the attention of the people providing them — the alignment problem, recursed one level down.