Generative Modelling

Flow Matching

Skip the noise — learn the wind that blows noise into data

01 · First principlesWhy simulate randomness at all?

Diffusion already taught us that a deterministic ODE — the probability-flow ODE — can transport noise to data just as well as the stochastic process can. But there we obtained the ODE indirectly: first define a noising SDE, then learn its score, then convert. Flow matching asks the blunt question: if a velocity field is all we need in the end, why not learn the velocity field directly?

The object to learn is vθ(x, t): a time-dependent vector field, the "wind" at every point of space. Sampling means dropping a noise particle at t = 0 and letting the wind carry it:

dx/dt = vθ(x, t),    x(0) ~ N(0, I)  ⟶  x(1) ~ pdata

No Brownian motion, no thousand corrector steps, no variance schedules. One ODE, integrated from 0 to 1.

02 · How it breaksThe naive objective is uncomputable

The obvious training target: pick a probability path pt interpolating noise to data, with true marginal velocity ut(x), and regress on it:

LFM = Et, x~pt ‖ vθ(x, t) − ut(x) ‖²
unknown — an average over every datum that could have sent mass through x

This is circular in the same way naive score matching was: ut(x) is the velocity of the marginal flow, which at any point x mixes the contributions of all data points whose paths pass nearby. Writing it down requires an integral over the whole dataset weighted by intractable posterior probabilities. We cannot evaluate the target even once.

03 · The trickConditional flow matching

Condition on a single data point x₁ and the impossible becomes trivial. Choose, per sample, the simplest conceivable path from a noise draw x₀ to x₁ — a straight line:

xt = (1 − t) x₀ + t x₁   ⟹   dxt/dt = x₁ − x₀
the conditional velocity — a constant, known in closed form

Then regress the network on this per-sample velocity:

LCFM = Et, x₁~data, x₀~N(0,I) ‖ vθ(xt, t) − (x₁ − x₀) ‖²

The theorem that makes this legitimate: θ LCFM = ∇θ LFM. The two losses differ only by a constant independent of θ, so they have identical gradients and identical minimisers. The argument takes three lines: expand both squares; the ‖vθ‖² terms match because xt has the same marginal either way; the cross terms match because the marginal velocity is, by definition, the conditional expectation E[x₁ − x₀ | xt = x]; what remains is θ-free. Regressing on a noisy-but-unbiased target trains the same network as regressing on the unobtainable clean one — the same move that rescued denoising score matching.

What the network actually learns: at each point, the average of all straight-line velocities passing through it. Individual targets conflict where lines cross; the L² minimiser is their mean, and that mean is exactly the marginal velocity ut.

04 · Visualize itWhy straight paths matter at inference

NOISE x₀ DATA x₁ CURVED PATH — MANY SMALL STEPS STRAIGHT PATH — A FEW BIG STEPS LAND EXACTLY

An Euler step follows the tangent. On a curved path the tangent leaves the path, so steps must be small; on a straight path the tangent is the path, so one step is already exact.

This is the practical heart of the method. Numerical integrators err in proportion to the curvature of the trajectory: each Euler step walks along the tangent, and curvature is precisely how fast the tangent lies. Diffusion's probability-flow trajectories are curved (the marginal velocity bends where paths cross), so they need tens of steps. Conditional straight-line paths keep the marginal flow only mildly curved — and rectified flow finishes the job: generate (noise, sample) pairs with a trained model, retrain on straight lines between those now-coupled pairs, and the paths straighten further. After a round or two, one to four steps suffice.

05 · The family treeWhere diffusion fits

Flow matching is not a rival theory; it is the same continuous-time picture entered through a different door. Choose a Gaussian probability path of a particular variance schedule and the marginal velocity field is an affine function of the score — the probability-flow ODE of a VP diffusion drops out as a special case. What changes is the parameterisation and the default geometry: predict velocity rather than noise, prefer straight conditional paths rather than variance-preserving curves, train by simulation-free regression either way.

Diffusion (score)Flow matching
Learned fieldx log pt(x)velocity v(x, t)
Conditional target−ε/σ (the added noise)x₁ − x₀ (a constant)
Default path shapecurved (VP schedule)straight lines
Sampling cost driverintegrator order × curvaturesame — but curvature is engineered down
Mental Model