General ML

Jensen–Shannon Divergence

KL, symmetrised through a mixture — and why GAN gradients die

01 · The problemKL is a poor referee between strangers

KL divergence is built for an asymmetric job: a true distribution p judging a model q. Ask it instead to compare two arbitrary distributions on equal footing — two generators, two corpora, real versus fake — and two flaws surface immediately:

Asymmetry. KL(p ∥ q) ≠ KL(q ∥ p), so "how far apart are p and q" has no single answer; the verdict depends on who is judging whom.
Infinities. If p puts mass anywhere q puts none, KL(p ∥ q) = ∞. Two distributions with disjoint support are "infinitely far apart" no matter whether they are adjacent or in different galaxies — the measure saturates and stops discriminating.

For early generative models this is not a corner case; it is the default. A fresh generator and the real data typically occupy disjoint slivers of a high-dimensional space.

02 · The fixCompare both sides to the midpoint

Neither p nor q deserves to be the reference, so make a neutral one — the average distribution m — and let each side pay KL against it:

m = (p + q)/2 JSD(p, q) = ½ KL(p ∥ m) + ½ KL(q ∥ m)

The mixture fixes both flaws at once. Symmetry is immediate from the formula. And the infinity is gone: wherever p(x) > 0 we have m(x) ≥ p(x)/2 > 0, so the log-ratio inside each KL is at most log 2 — the reference can never be empty where either side has mass. Hence:

0 ≤ JSD(p, q) ≤ log 2 — equality at log 2 ⇔ disjoint supports; in bits, max = 1

Bonus structure: √JSD satisfies the triangle inequality, so it is a genuine metric on distributions — something neither KL direction can claim. There is also an information reading: JSD(p, q) is the mutual information between a sample and a fair coin saying which distribution it came from. Indistinguishable sources give 0; perfectly separable sources give 1 bit.

03 · Visualise itBounded, symmetric — and flat where it matters

Top: overlapping supports — JSD varies smoothly as q moves. Bottom: disjoint supports — JSD is pinned at log 2 regardless of the gap.

The bottom panel shows the catch baked into the bound. Once supports are disjoint, JSD reports log 2 whatever the distance between them. Bounded means saturating, and saturating means the derivative with respect to "move q toward p" is zero. Remember this panel; it reappears as a training pathology in the next section.

04 · The GAN connectionWhere JSD secretly ran the show

The original GAN objective, with an optimal discriminator D* plugged in, reduces to:

min_G max_D V(D, G) ⟹ min_G [ 2·JSD(p_data, p_G) − log 4 ]

So a GAN generator is, in the idealised limit, performing gradient descent on JSD. Elegant — and the source of the field's most famous failure. Early in training, p_G and p_data have essentially disjoint supports (both are thin manifolds in pixel space), which is precisely the flat regime above: JSD sits at log 2, a confident discriminator saturates, and the generator's gradient vanishes. The theory's tidiest property and the practice's worst instability are the same fact.

The one-line sequel: Wasserstein distance measures the cost of transporting mass from q to p, so it keeps growing — and keeps a gradient — even between disjoint distributions; WGAN swaps the divergence to buy exactly that slope.

05 · ChoosingKL or JSD?

Property	KL(p ∥ q)	JSD(p, q)
Symmetric	No	Yes
Bounded	No (∞ on disjoint support)	Yes — [0, log 2]
Metric (after √)	No	Yes
Gradient between disjoint supports	Undefined / ∞	Zero (saturated)
Natural role	Truth judging a model (MLE, VAEs, RLHF)	Two peers compared fairly (GAN theory, corpus drift, embedding-distribution shift)

Note the shared weakness in the fourth row: JSD fixes KL's infinity but not the underlying blindness between non-overlapping distributions. When that regime is the one you care about, neither divergence is the right tool.

Mental Model

JSD = average the two KLs against the midpoint mixture m = (p+q)/2.
Symmetric, bounded by log 2, never infinite — m always has mass where either side does.
√JSD is a true metric; JSD itself is the distinguishability of one sample (in bits).
The original GAN minimises JSD via its optimal discriminator.
Bounded means saturating: disjoint supports pin JSD at log 2 and kill the gradient — Wasserstein keeps slope by pricing transport instead.