General ML

Backpropagation

All the gradients for the price of one forward pass

01 · First principlesThe problem that forces it to exist

Gradient descent needs ∂L/∂θ for every parameter, every step. The naive way to get a derivative is finite differences: nudge θj, rerun the network, see how L moved.

∂L/∂θj ≈ ( L(θ + εej) − L(θ) ) / ε   →   one forward pass per parameter

With 109 parameters that is a billion forward passes for a single optimisation step. Not slow — impossible. The question backprop answers: can we get all the partial derivatives for roughly the cost of one pass? Yes, because the derivatives massively share structure.

02 · The ideaChain rule + dynamic programming

A network is a composition: x → h1 → h2 → … → hn → L. The chain rule says the gradient with respect to anything early is a product of local derivatives along the path. The key observation is about reuse: every parameter in layer k needs the same downstream factor ∂L/∂hk. Compute that quantity once, and all of layer k's parameter gradients are one cheap local step away.

So we sweep backwards, defining δk = ∂L/∂hk and reusing it:

δn = ∂L/∂hn
δk = δk+1 · ∂hk+1/∂hk   (reuse, never recompute)
∂L/∂θk = δk · ∂hk/∂θk

This is textbook dynamic programming: the naive method recomputes the same downstream product over and over; backprop memoises it. One backward sweep visits each edge of the computation graph once.

x h₁ h₂ L FORWARD · compute and cache activations BACKWARD · δₖ = δₖ₊₁ · local Jacobian, one VJP per edge cached activations feed the local derivatives (the memory bill)

One forward sweep stores activations; one backward sweep reuses each δ to price every parameter.

03 · The mechanicsVector–Jacobian products everywhere

Each layer's local derivative ∂hk+1/∂hk is a Jacobian, and for a hidden width of 4096 that matrix has ~16M entries per example. Backprop never builds it. What the backward sweep actually needs is only the product of a row vector with that Jacobian:

δk = δk+1 Jk   —  a vector–Jacobian product, computed directly

For a linear layer hk+1 = W hk, the VJP is just δk+1W — a matrix–vector multiply, same shape of work as the forward pass. For an elementwise activation it is an elementwise multiply. Every primitive in an autodiff framework ships its VJP rule; "backprop" is the composition of these rules in reverse topological order. (Forward-mode autodiff composes Jacobian–vector products instead — efficient for few inputs and many outputs, which is the opposite of training, where we have one scalar loss and millions of inputs.)

Cost: each VJP costs about as much as the layer's forward computation, so the whole backward pass ≈ one extra forward pass. Total gradient cost is ~2–3× a forward run, independent of parameter count. That constant is the reason deep learning is feasible.

04 · The priceThe memory bill

The local derivatives are functions of the forward activations (the VJP through W needs hk; the VJP through ReLU needs the sign of its input). So the forward pass must cache every intermediate activation until the backward pass consumes it. Training memory is dominated not by the weights but by activations × batch size × depth.

The standard escape is gradient checkpointing: store activations only every few layers, recompute the rest during the backward pass. It trades roughly one extra forward pass of compute for an O(√depth) memory footprint — the same compute-for-memory bargain backprop itself declined to make.

05 · PerspectiveWhat backprop is not

It is not a learning rule and not specific to neural networks; it is an algorithm for computing exact derivatives of any composed differentiable program. The learning happens in the optimiser that consumes the gradients (SGD, Adam). And it propagates whatever the loss provides — which is why the shape of the loss gradient and the saturation of activations matter: backprop multiplies local derivatives, and long products of numbers below one vanish.

Mental Model