Generative Modelling

Diffusion: Forward Process

Destroying data carefully, so that creation can be learned

01 · First principlesWhy destroy the data at all?

Generation in one shot — map pure noise to a finished image in a single step — asks one network to invent everything at once: layout, objects, texture, lighting. GANs attempt exactly this and pay for it in training instability; VAEs pay in blur. The transformation from noise to data is simply too violent to learn as one move.

Diffusion's idea is to never ask for the violent move. Instead, define a forward process that destroys data in many tiny steps, each adding a small amount of Gaussian noise. Each tiny destruction is trivially simple — and, crucially, a tiny destruction is only slightly ambiguous to undo. Then train a network to undo one tiny step at a time (the reverse process).

The analogy is stirring milk into coffee. One hundred small stirs take the swirl to uniform beige gradually; between any two consecutive frames, the change is small enough that you could plausibly say what the previous frame looked like. Asked to reconstruct the original swirl from the final beige in one step, you could not. Asked to reverse one small stir, you nearly can — and "nearly, with small Gaussian uncertainty" is exactly what a network can learn.

The design principle: we choose the destruction precisely so that its reversal decomposes into many easy, locally-Gaussian problems.

02 · The processSmall Gaussian steps, and the closed form

Fix a variance schedule β1, …, βT (small numbers, e.g. 10−4 to 0.02). Each step shrinks the signal slightly and tops it up with fresh noise:

q(xt | xt−1) = N( √(1−βt) · xt−1,  βt I )

The shrink factor √(1−βt) matters: it keeps the total variance bounded, so the process converges to N(0, I) instead of exploding. Now the key engineering fact. Because a Gaussian convolved with a Gaussian is Gaussian, the whole chain composes in closed form. Writing αt = 1−βt and ᾱt = ∏s≤t αs:

q(xt | x0) = N( √ᾱt · x0,  (1−ᾱt) I )   ⇔   xt = √ᾱt x0 + √(1−ᾱt) ε
surviving signal accumulated noise

This single line is why diffusion trains cheaply. To get a training example at noise level t, we do not simulate t steps of stirring — we jump straight there with one sample of ε. Every minibatch can hit a random t at the cost of one multiply-add. Without this closed form, training would require running the chain, and the method would be impractical.

As t → T, ᾱt → 0 and q(xT|x0) → N(0, I) regardless of x0: every image is stirred into the same beige. That shared endpoint is what the sampler will start from.

03 · Visualize itA 1-D distribution dissolving

t = 0 t = T/4 t = T/2 t = T ≈ N(0,1) FORWARD NOISING →

A bimodal data distribution under increasing noise: modes broaden, merge, and converge to the same standard Gaussian (dashed). All structure is gone by t = T — by construction.

04 · The knobSchedules and signal-to-noise ratio

The schedule {βt} is best read through one summary number, the signal-to-noise ratio at step t:

SNR(t) = ᾱt / (1 − ᾱt)

SNR must travel from very large (clean data) to near zero (pure noise); the schedule decides how it spends time along the way. That allocation is a curriculum: each noise level the network sees is a different task — high SNR steps teach fine texture repair, low SNR steps teach global layout from almost nothing.

ScheduleBehaviourCaveat
Linear β (original DDPM) Simple; destroys signal aggressively early on. Wastes many late steps at SNR ≈ 0, where there is nothing left to learn.
Cosine t follows a cosine; SNR decays smoothly, mid-range levels get more time. The common default for pixel-space models.
Shifted / resolution-aware Shift SNR lower for high-resolution images, where redundancy among pixels makes a given noise level effectively easier. The "right" schedule depends on data dimensionality — there is no universal one.
One subtlety worth keeping: the forward process has no learned parameters. Everything trainable lives in the reverse process; the forward process is pure scaffolding, chosen once, that defines the family of denoising tasks. Its continuous-time limit is taken up in diffusion as SDEs.
Mental Model