General ML

KL Divergence

The price of believing q when the truth is p

01 · First principlesHow wrong is a wrong distribution?

From the entropy note: a code built for q assigns each event a length of −log q(x). If the data actually comes from p, the average overspend — extra bits per symbol, beyond the irreducible H(p) — is:

KL(p ∥ q) = Ex∼p[ log p(x) − log q(x) ] = Σx p(x) log p(x)/q(x)
expected extra bits for using q's codebook on p's data

Read the structure carefully: the log-ratio measures pointwise wrongness, and the expectation is taken under p — the truth decides which mistakes matter. Wherever p puts no mass, q can be arbitrarily wrong for free; wherever p puts mass and q puts nearly none, log p/q explodes. (That explosion is not a bug; it is the whole personality of the measure, as we will see.)

02 · The guaranteeKL ≥ 0, in two lines

log is concave, so Jensen's inequality says E[log Z] ≤ log E[Z]. Apply it to Z = q/p under p:

−KL(p ∥ q) = Ep[ log (q/p) ] ≤ log Ep[ q/p ] = log Σx p(x)·(q(x)/p(x)) = log 1 = 0
⇒ KL(p ∥ q) ≥ 0,  with equality  ⇔  p = q

So KL behaves like a distance in one respect — zero exactly at identity, positive otherwise — and we are tempted to treat it as one. It is not one, and the failure is instructive rather than embarrassing.

03 · The asymmetryNot a distance — a choice of which errors to fear

KL(p ∥ q) ≠ KL(q ∥ p), and the direction you optimise decides what kind of approximation you get. Fit a single Gaussian q to a bimodal target p and the two directions give two different answers, both correct by their own lights:

FORWARD KL(p∥q) · MEAN-SEEKING REVERSE KL(q∥p) · MODE-SEEKING — TARGET p (BIMODAL) - - FIT q (SPREADS OVER BOTH) — TARGET p (BIMODAL) - - FIT q (COMMITS TO ONE MODE)

One bimodal target, one Gaussian budget, two KL directions, two philosophies of approximation.

Forward · KL(p ∥ q) · zero-avoiding
Expectation under p: every region where p has mass must get q-mass, or log p/q → ∞. So q stretches to cover everything — mean-seeking, blurry between the modes. This is what MLE minimises.
Reverse · KL(q ∥ p) · zero-forcing
Expectation under q: q is only punished where it puts mass, so it retreats from p's empty valleys and commits to one mode. Sharp but partial. This is what variational inference typically minimises.

04 · The trinityCross-entropy = KL = MLE

Three objectives that sound different and are the same optimisation. With p the data distribution fixed and qθ the model:

H(p, qθ) = H(p) + KL(p ∥ qθ)   — H(p) is a constant in θ
⇒ argminθ H(p, qθ) = argminθ KL(p ∥ qθ) = argmaxθ Ep[log qθ(x)]

And the rightmost term, estimated with samples, is exactly the log-likelihood of the training set. Minimising cross-entropy, minimising forward KL, and maximum likelihood are one procedure wearing three names — which is why classifiers inherit forward KL's mean-seeking generosity toward covering all the data.

05 · Field guideWhere KL lives in modern ML

SystemTermWhat the KL is doing
Any classifier / LMcross-entropy lossForward KL to the data distribution, via the trinity above.
VAEKL(q(z|x) ∥ p(z))Keeps the encoder's posterior pinned near the prior so the latent space stays usable.
RLHF / PPOKL(π ∥ πref) penaltyA leash: lets the policy chase reward only as far as it stays probabilistically close to the reference model.
DistillationKL(teacher ∥ student)The student matches the teacher's full soft distribution, not just its argmax — the dark knowledge is in the ratios.
Caveat that earns its own note: KL is infinite on disjoint supports and asymmetric by construction. When you need a symmetric, bounded comparison between two arbitrary distributions, see Jensen–Shannon divergence.
Mental Model