Applied ML

Exploding / Vanishing Gradients

Depth multiplies Jacobians, and products of factors ≠ 1 are exponential

01 · First principlesBackprop is a product, not a sum

The chain rule through L layers is a product of L Jacobians:

∂loss/∂h0  =  JL · JL−1 · … · J1   ⟹   ‖gradient‖  ~  γL

where γ is, loosely, the typical singular value of a layer's Jacobian (weight scale × activation derivative). Products amplify what sums forgive. If each factor averages 0.9, after 100 layers the signal is 0.9100 ≈ 3×10−5; if each averages 1.1, it is 1.1100 ≈ 14,000. Nothing about the network is "broken" — exponential behaviour is the default for deep composition, and γ ≈ 1 is a measure-zero target that every architectural trick below exists to widen into a basin.

DEPTH L → ‖GRADIENT‖ (LOG SCALE) → γ = 1.05 — explodes γ = 1.00 — survives γ = 0.95 — vanishes γ^L FOR γ = 0.95 / 1.00 / 1.05 — STRAIGHT LINES ON A LOG AXIS, ±5% COMPOUNDS TO ORDERS OF MAGNITUDE

On a log axis, γL is a straight line of slope log γ. Only the knife-edge γ = 1 carries signal through arbitrary depth.

02 · SymptomsWhat each failure looks like

Vanishing
Loss plateaus early or learns only "shallow" structure; gradient norms per layer fall by orders of magnitude from output to input (plot this — it is one line of code); early layers' weights barely move from init. The run looks calm and is quietly dead at the bottom.
Exploding
Loss spikes or NaNs; gradient norm jumps orders of magnitude on some steps; with fp16, infs arrive first (see mixed precision). Loud, fast, and — unlike vanishing — at least easy to notice.

Classic RNNs were the worst case: the same weight matrix is multiplied in at every timestep, so γ is locked to that one matrix's spectrum and tanh derivatives ≤ 1 push it down further. Depth-L feedforward nets at least draw L different matrices; an RNN over 500 tokens raises one matrix to the 500th power.

03 · The fixesWhat each one really does to γL

FixWhat it really doesLimits
Init scale (Xavier, Kaiming) Choose weight variance so each layer's Jacobian has singular values ≈ 1 at initialisation — start the product at γ = 1 Only controls step 0; training moves the spectrum, and init cannot follow it
Normalisation (LayerNorm, BatchNorm) Re-standardises activations at every layer, every step — continuously re-centering γ near 1 instead of hoping it stays there Adds compute and (BN) batch coupling; does not by itself fix very deep stacks
Residual connections h ← h + f(h) makes each Jacobian I + Jf: identity is the default path, and the gradient reaches early layers through pure addition, untouched by any multiplication The +f branches still sum; very deep residual nets scale branches (e.g. by 1/√L) to keep variance flat
LSTM / GRU gates Replace multiplication through time with a gated additive cell state — the recurrent analogue of a residual, with learned gates deciding what survives Helps to hundreds of steps, not thousands; attention eventually sidestepped recurrence entirely
Clipping Caps the explosion's damage after it happens; changes nothing about why γ > 1 Explosions only; no help whatsoever for vanishing — you cannot clip upward

Notice the shared shape of the real fixes: every one of them converts a multiplicative path into an additive one, or pins the multiplicative factors to 1. Residuals do it in depth, gates do it in time, normalisation does it statistically, init does it at the starting line. The transformer recipe — residual stream + LayerNorm + careful init — is all four at once, which is why 100-layer transformers train and 100-layer plain MLPs do not.

04 · DiagnosisThe per-layer gradient norm plot

  1. Log ‖grad‖ per layer (or per block) once every few hundred steps; plot layer index against log norm.
  2. Healthy: roughly flat across depth, drifting slowly over training.
  3. Vanishing: a straight descending line on the log plot — you are literally seeing the slope log γ from the figure above. Suspect a missing or misplaced norm, a residual branch that is not one, or bad init in a custom block.
  4. Exploding: norms fine, then spiking on rare steps → that is a clipping/data/LR question, not an architecture question. Norms growing steadily with depth → architecture or init.
Worth internalising: in a modern transformer stack the problem is mostly solved by construction, so when it reappears the cause is nearly always a deviation — a custom layer outside the residual stream, a norm removed "because it seemed redundant", an init copied from a differently-shaped layer.
Mental Model