Generative Modelling

Diffusion: Reverse Process

DDPM and DDIM — unstirring the coffee, one step at a time

01 · First principlesWhy small steps are reversible

The forward process destroyed data in tiny Gaussian steps. We now want q(xt−1|xt) — the distribution over "what the previous frame was". In general, reversing a stochastic process is as hard as knowing the data distribution itself. The whole construction was arranged so that one special case holds:

The key fact: when the forward steps are small and Gaussian, the true reverse conditionals are, to first order, also Gaussian — with a mean that is a small correction of xt in the direction of the score ∇ log q(xt). A Gaussian with a computable mean is something a network can fit.

So the entire problem of generation reduces to: estimate the score of the noised data at every noise level. Everything else is plumbing.

02 · The parameterisationPredict the noise

Condition on x0 and the reverse step has an exact closed form (Bayes on three Gaussians): q(xt−1|xt, x0) = N(μ̃(xt, x0), β̃tI). At sampling time we lack x0 — but the forward closed form xt = √ᾱt x0 + √(1−ᾱt) ε says that knowing the noise ε is the same as knowing x0. So train a network εθ(xt, t) to predict the noise that was added. The DDPM derivation starts from a VAE-style ELBO over the chain, but after simplification (dropping per-step weights, which empirically helps) the loss is plain regression:

Lsimple = Ex0, t, ε [ ‖ εεθ(√ᾱt x0 + √(1−ᾱt) ε,  t) ‖² ]
noise actually added network's guess

No adversary, no encoder, no partition function — sample an image, a timestep, a noise vector; one forward pass; squared error. This boring loss is most of why diffusion training is so stable. And it is denoising score matching in disguise: the predicted noise gives the score directly,

xt log q(xt) ≈ −εθ(xt, t) / √(1−ᾱt)

03 · Sampler oneDDPM: stochastic ancestral sampling

Start at xT ~ N(0, I) and walk the chain backwards. At each step, use εθ to form the mean of the reverse Gaussian, then add fresh noise scaled by the step's variance:

xt−1 = (1/√αt) ( xt − (βt/√(1−ᾱt)) · εθ(xt, t) ) + σt z,    z ~ N(0, I)
fresh noise each step

This is Langevin-flavoured sampling on a ladder of noise levels: denoise a little, re-noise a little less. The re-injection of noise is what keeps the sampler honest about the remaining uncertainty at each level. The cost is the step count — the Gaussian approximation to the reverse step is only valid when steps are small, so DDPM needs the full T (originally 1000) network evaluations. One image, a thousand forward passes.

04 · Sampler twoDDIM: same network, deterministic path

DDIM's observation: the training loss above only ever uses the marginals q(xt|x0). Many different joint processes share those marginals — including non-Markovian ones in which xt−1 depends on both xt and x0 with zero injected noise. Pick that one, plug in the network's estimate x̂0 = (xt − √(1−ᾱt) εθ)/√ᾱt, and the update becomes deterministic:

xt−1 = √ᾱt−1 · 0 + √(1−ᾱt−1) · εθ(xt, t)
re-noise the current best guess of x₀ — with the predicted, not fresh, noise

Read the move: estimate the clean image, then place it at noise level t−1 using the predicted noise direction rather than a fresh draw. Because no randomness enters, the trajectory is a smooth path that tolerates big jumps — 20 to 50 steps instead of 1000 — on the very same trained network. No retraining, only a different walk.

Determinism buys two further properties. The map from xT to the image is now a function, so xT acts as a true latent code (interpolate between two codes, get a semantic blend). And the map is invertible: run the updates in reverse to encode a real image into noise — the basis of most diffusion-based image editing. DDIM is, in fact, a discretisation of the probability-flow ODE of the SDE view.

DDPMDDIM
Reverse stepGaussian, fresh noise each stepDeterministic (σ = 0)
ChainMarkovian, ancestralNon-Markovian, same marginals
Steps needed~1000~20–50
xT → imageOne-to-many (stochastic)Function; invertible; latent space
Sample diversity at fixed xTYes, from injected noiseNone — all variety comes from xT
Retraining requiredNone; same εθ

05 · Visualize itTwo walks home

t = T (NOISE) → t = 0 (DATA) x_T x_0 DDPM: 1000 noisy steps DDIM: few deterministic steps

Same trained network, two samplers. DDPM staggers home, re-noising at every step; DDIM glides along a smooth deterministic trajectory it can traverse in large jumps.

Mental Model