General ML

Jensen–Shannon Divergence

KL, symmetrised through a mixture — and why GAN gradients die

01 · The problemKL is a poor referee between strangers

KL divergence is built for an asymmetric job: a true distribution p judging a model q. Ask it instead to compare two arbitrary distributions on equal footing — two generators, two corpora, real versus fake — and two flaws surface immediately:

  1. Asymmetry. KL(p ∥ q) ≠ KL(q ∥ p), so "how far apart are p and q" has no single answer; the verdict depends on who is judging whom.
  2. Infinities. If p puts mass anywhere q puts none, KL(p ∥ q) = ∞. Two distributions with disjoint support are "infinitely far apart" no matter whether they are adjacent or in different galaxies — the measure saturates and stops discriminating.

For early generative models this is not a corner case; it is the default. A fresh generator and the real data typically occupy disjoint slivers of a high-dimensional space.

02 · The fixCompare both sides to the midpoint

Neither p nor q deserves to be the reference, so make a neutral one — the average distribution m — and let each side pay KL against it:

m = (p + q)/2      JSD(p, q) = ½ KL(p ∥ m) + ½ KL(q ∥ m)

The mixture fixes both flaws at once. Symmetry is immediate from the formula. And the infinity is gone: wherever p(x) > 0 we have m(x) ≥ p(x)/2 > 0, so the log-ratio inside each KL is at most log 2 — the reference can never be empty where either side has mass. Hence:

0 ≤ JSD(p, q) ≤ log 2   — equality at log 2 ⇔ disjoint supports; in bits, max = 1
Bonus structure: √JSD satisfies the triangle inequality, so it is a genuine metric on distributions — something neither KL direction can claim. There is also an information reading: JSD(p, q) is the mutual information between a sample and a fair coin saying which distribution it came from. Indistinguishable sources give 0; perfectly separable sources give 1 bit.

03 · Visualise itBounded, symmetric — and flat where it matters

p q m OVERLAP: JSD INFORMATIVE, GRADIENT EXISTS p q DISJOINT: JSD = LOG 2 FLAT, GRADIENT = 0 ↔ MOVING q CLOSER CHANGES NOTHING

Top: overlapping supports — JSD varies smoothly as q moves. Bottom: disjoint supports — JSD is pinned at log 2 regardless of the gap.

The bottom panel shows the catch baked into the bound. Once supports are disjoint, JSD reports log 2 whatever the distance between them. Bounded means saturating, and saturating means the derivative with respect to "move q toward p" is zero. Remember this panel; it reappears as a training pathology in the next section.

04 · The GAN connectionWhere JSD secretly ran the show

The original GAN objective, with an optimal discriminator D* plugged in, reduces to:

minG maxD V(D, G)  ⟹  minG [ 2·JSD(pdata, pG) − log 4 ]

So a GAN generator is, in the idealised limit, performing gradient descent on JSD. Elegant — and the source of the field's most famous failure. Early in training, pG and pdata have essentially disjoint supports (both are thin manifolds in pixel space), which is precisely the flat regime above: JSD sits at log 2, a confident discriminator saturates, and the generator's gradient vanishes. The theory's tidiest property and the practice's worst instability are the same fact.

The one-line sequel: Wasserstein distance measures the cost of transporting mass from q to p, so it keeps growing — and keeps a gradient — even between disjoint distributions; WGAN swaps the divergence to buy exactly that slope.

05 · ChoosingKL or JSD?

PropertyKL(p ∥ q)JSD(p, q)
SymmetricNoYes
BoundedNo (∞ on disjoint support)Yes — [0, log 2]
Metric (after √)NoYes
Gradient between disjoint supportsUndefined / ∞Zero (saturated)
Natural roleTruth judging a model (MLE, VAEs, RLHF)Two peers compared fairly (GAN theory, corpus drift, embedding-distribution shift)

Note the shared weakness in the fourth row: JSD fixes KL's infinity but not the underlying blindness between non-overlapping distributions. When that regime is the one you care about, neither divergence is the right tool.

Mental Model