An intractable integral, and the bound that makes it trainable
A latent-variable model is the natural way to say "data has hidden causes": draw z from a simple prior p(z), then decode it into an observation through pθ(x|z). The model's density on data is the marginal
We would like to maximise log pθ(x). But the integral runs over every possible z, and for a neural decoder it has no closed form. Naive Monte Carlo — sample z from the prior, average pθ(x|z) — fails in a precise way: for any particular x, almost every prior sample is a z that decodes to something else entirely, contributing essentially zero. The estimate is dominated by a vanishingly rare set of "good" z values you almost never hit.
Introduce an encoder qφ(z|x): a network that proposes plausible latents for a given x. Multiply and divide by it (importance weighting), then push the log inside the expectation with Jensen's inequality:
Read it as a contract. The reconstruction term says: latents proposed for x must decode back to x. The KL term says: those latents must not stray far from the prior, because the prior is all we will have at generation time. Both expectations are over q, which we can sample cheaply — the intractable search has become a regression-like training objective.
How loose is the bound? One line of algebra (write the joint as posterior times marginal) gives the exact gap:
The slack is exactly the encoder's distance from the true posterior. Maximising the ELBO over φ therefore does two jobs at once: it tightens the bound (better inference) and, through θ, raises the likelihood itself. The bound is tight precisely when q nails the posterior.
For fixed θ, log p(x) is a ceiling. The ELBO sits below it by exactly KL(q ‖ true posterior); improving the encoder closes the gap from beneath.
One obstacle remains. The reconstruction term is an expectation over qφ, and φ sits inside the sampling distribution. You cannot backpropagate through the act of sampling — the gradient ∇φ Ez~qφ[·] is not the expectation of a gradient.
The fix is to relocate the randomness. For a Gaussian encoder qφ(z|x) = N(μφ(x), σφ(x)²), write the sample as a deterministic function of parameter-free noise:
Now ε carries all the randomness and φ only shapes a deterministic path, so ordinary backpropagation flows from the decoder's loss, through z, into the encoder. (This is the single implementation detail that made VAEs practical; the score-function/REINFORCE estimator works without it but is far noisier.)
The standard complaint about VAEs is soft, averaged-looking output, and it is not a mystery — it follows from two choices made above.
Note the contrast with GANs, which refuse pixel-wise likelihood exactly to escape this averaging, and with diffusion, which keeps the likelihood view but breaks generation into many small denoising steps so no single step has to commit to an average.
β-VAE in one line: scale the KL term by β > 1 to buy a more organised, disentangled latent space at the price of even softer reconstructions — the same contract, with the "stay near the prior" clause renegotiated upward.