Generative Modelling

VAEs and the ELBO

An intractable integral, and the bound that makes it trainable

01 · First principlesThe integral nobody can do

A latent-variable model is the natural way to say "data has hidden causes": draw z from a simple prior p(z), then decode it into an observation through p_θ(x|z). The model's density on data is the marginal

p_θ(x) = ∫ p_θ(x|z) p(z) dz

We would like to maximise log p_θ(x). But the integral runs over every possible z, and for a neural decoder it has no closed form. Naive Monte Carlo — sample z from the prior, average p_θ(x|z) — fails in a precise way: for any particular x, almost every prior sample is a z that decodes to something else entirely, contributing essentially zero. The estimate is dominated by a vanishingly rare set of "good" z values you almost never hit.

The forcing question: we cannot search all of z-space for the few latents that explain x. What if a second network told us where to look?

02 · The derivationImportance weights, then Jensen

Introduce an encoder q_φ(z|x): a network that proposes plausible latents for a given x. Multiply and divide by it (importance weighting), then push the log inside the expectation with Jensen's inequality:

Reconstruction Stay near the prior

Read it as a contract. The reconstruction term says: latents proposed for x must decode back to x. The KL term says: those latents must not stray far from the prior, because the prior is all we will have at generation time. Both expectations are over q, which we can sample cheaply — the intractable search has become a regression-like training objective.

How loose is the bound? One line of algebra (write the joint as posterior times marginal) gives the exact gap:

log p_θ(x) = ELBO + KL(q_φ(z|x) ‖ p_θ(z|x))

The slack is exactly the encoder's distance from the true posterior. Maximising the ELBO over φ therefore does two jobs at once: it tightens the bound (better inference) and, through θ, raises the likelihood itself. The bound is tight precisely when q nails the posterior.

03 · Visualize itThe ELBO gap

For fixed θ, log p(x) is a ceiling. The ELBO sits below it by exactly KL(q ‖ true posterior); improving the encoder closes the gap from beneath.

04 · The trickReparameterisation: moving the randomness

One obstacle remains. The reconstruction term is an expectation over q_φ, and φ sits inside the sampling distribution. You cannot backpropagate through the act of sampling — the gradient ∇_φ E_{z~q_φ}[·] is not the expectation of a gradient.

The fix is to relocate the randomness. For a Gaussian encoder q_φ(z|x) = N(μ_φ(x), σ_φ(x)²), write the sample as a deterministic function of parameter-free noise:

z = μ_φ(x) + σ_φ(x) ⊙ ε, ε ~ N(0, I) ⇒ ∇_φ E_ε[f(z)] = E_ε[∇_φ f(z)]

Now ε carries all the randomness and φ only shapes a deterministic path, so ordinary backpropagation flows from the decoder's loss, through z, into the encoder. (This is the single implementation detail that made VAEs practical; the score-function/REINFORCE estimator works without it but is far noisier.)

05 · The costWhy the samples are blurry

The standard complaint about VAEs is soft, averaged-looking output, and it is not a mystery — it follows from two choices made above.

Likelihood with a factorised decoder. p_θ(x|z) is usually an independent Gaussian per pixel, so the reconstruction term is per-pixel squared error. When one z is consistent with several sharp images, the loss-minimising decoder outputs their mean — which is blurry. The model is rewarded for hedging.
An imperfect q leaks noise. Whatever the encoder cannot disambiguate, the decoder must average over. Posterior approximation error shows up directly as smoothing in pixel space.

Note the contrast with GANs, which refuse pixel-wise likelihood exactly to escape this averaging, and with diffusion, which keeps the likelihood view but breaks generation into many small denoising steps so no single step has to commit to an average.

β-VAE in one line: scale the KL term by β > 1 to buy a more organised, disentangled latent space at the price of even softer reconstructions — the same contract, with the "stay near the prior" clause renegotiated upward.

Where VAEs live now: as the compression stage of latent diffusion. The VAE supplies a smooth, decodable latent space; a diffusion model handles the actual generation inside it.

Mental Model

The marginal likelihood is an impossible search over latents; the encoder is a learned search heuristic, and the ELBO is the price of trusting it.
ELBO = reconstruction − KL-to-prior: explain the data using latents the prior could plausibly produce.
log p(x) = ELBO + KL(q ‖ posterior): the bound's slack is exactly the encoder's error, so it tightens itself during training.
Reparameterisation moves randomness into ε so gradients can flow through the sampling step.
Blur is not a bug in the code; it is per-pixel likelihood doing what averaging does.