General ML

KL Divergence

The price of believing q when the truth is p

01 · First principlesHow wrong is a wrong distribution?

From the entropy note: a code built for q assigns each event a length of −log q(x). If the data actually comes from p, the average overspend — extra bits per symbol, beyond the irreducible H(p) — is:

KL(p ∥ q) = E_x∼p[ log p(x) − log q(x) ] = Σ_x p(x) log p(x)/q(x)

expected extra bits for using q's codebook on p's data

Read the structure carefully: the log-ratio measures pointwise wrongness, and the expectation is taken under p — the truth decides which mistakes matter. Wherever p puts no mass, q can be arbitrarily wrong for free; wherever p puts mass and q puts nearly none, log p/q explodes. (That explosion is not a bug; it is the whole personality of the measure, as we will see.)

02 · The guaranteeKL ≥ 0, in two lines

log is concave, so Jensen's inequality says E[log Z] ≤ log E[Z]. Apply it to Z = q/p under p:

−KL(p ∥ q) = E_p[ log (q/p) ] ≤ log E_p[ q/p ] = log Σ_x p(x)·(q(x)/p(x)) = log 1 = 0
⇒ KL(p ∥ q) ≥ 0, with equality ⇔ p = q

So KL behaves like a distance in one respect — zero exactly at identity, positive otherwise — and we are tempted to treat it as one. It is not one, and the failure is instructive rather than embarrassing.

03 · The asymmetryNot a distance — a choice of which errors to fear

KL(p ∥ q) ≠ KL(q ∥ p), and the direction you optimise decides what kind of approximation you get. Fit a single Gaussian q to a bimodal target p and the two directions give two different answers, both correct by their own lights:

One bimodal target, one Gaussian budget, two KL directions, two philosophies of approximation.

Forward · KL(p ∥ q) · zero-avoiding

Expectation under p: every region where p has mass must get q-mass, or log p/q → ∞. So q stretches to cover everything — mean-seeking, blurry between the modes. This is what MLE minimises.

Reverse · KL(q ∥ p) · zero-forcing

Expectation under q: q is only punished where it puts mass, so it retreats from p's empty valleys and commits to one mode. Sharp but partial. This is what variational inference typically minimises.

04 · The trinityCross-entropy = KL = MLE

Three objectives that sound different and are the same optimisation. With p the data distribution fixed and q_θ the model:

H(p, q_θ) = H(p) + KL(p ∥ q_θ) — H(p) is a constant in θ
⇒ argmin_θ H(p, q_θ) = argmin_θ KL(p ∥ q_θ) = argmax_θ E_p[log q_θ(x)]

And the rightmost term, estimated with samples, is exactly the log-likelihood of the training set. Minimising cross-entropy, minimising forward KL, and maximum likelihood are one procedure wearing three names — which is why classifiers inherit forward KL's mean-seeking generosity toward covering all the data.

05 · Field guideWhere KL lives in modern ML

System	Term	What the KL is doing
Any classifier / LM	cross-entropy loss	Forward KL to the data distribution, via the trinity above.
VAE	KL(q(z\|x) ∥ p(z))	Keeps the encoder's posterior pinned near the prior so the latent space stays usable.
RLHF / PPO	KL(π ∥ π_ref) penalty	A leash: lets the policy chase reward only as far as it stays probabilistically close to the reference model.
Distillation	KL(teacher ∥ student)	The student matches the teacher's full soft distribution, not just its argmax — the dark knowledge is in the ratios.

Caveat that earns its own note: KL is infinite on disjoint supports and asymmetric by construction. When you need a symmetric, bounded comparison between two arbitrary distributions, see Jensen–Shannon divergence.

Mental Model

KL(p ∥ q) = expected extra bits paid for compressing p's data with q's codebook.
Always ≥ 0 (two lines of Jensen); zero only at p = q.
Not symmetric, on purpose: forward KL spreads to cover p, reverse KL commits to a mode.
Cross-entropy, forward KL, and MLE are the same objective; H(p) is the constant between them.
In the wild: the VAE regulariser, the RLHF leash, the distillation target.