Depth multiplies Jacobians, and products of factors ≠ 1 are exponential
The chain rule through L layers is a product of L Jacobians:
where γ is, loosely, the typical singular value of a layer's Jacobian (weight scale × activation derivative). Products amplify what sums forgive. If each factor averages 0.9, after 100 layers the signal is 0.9100 ≈ 3×10−5; if each averages 1.1, it is 1.1100 ≈ 14,000. Nothing about the network is "broken" — exponential behaviour is the default for deep composition, and γ ≈ 1 is a measure-zero target that every architectural trick below exists to widen into a basin.
On a log axis, γL is a straight line of slope log γ. Only the knife-edge γ = 1 carries signal through arbitrary depth.
Classic RNNs were the worst case: the same weight matrix is multiplied in at every timestep, so γ is locked to that one matrix's spectrum and tanh derivatives ≤ 1 push it down further. Depth-L feedforward nets at least draw L different matrices; an RNN over 500 tokens raises one matrix to the 500th power.
| Fix | What it really does | Limits |
|---|---|---|
| Init scale (Xavier, Kaiming) | Choose weight variance so each layer's Jacobian has singular values ≈ 1 at initialisation — start the product at γ = 1 | Only controls step 0; training moves the spectrum, and init cannot follow it |
| Normalisation (LayerNorm, BatchNorm) | Re-standardises activations at every layer, every step — continuously re-centering γ near 1 instead of hoping it stays there | Adds compute and (BN) batch coupling; does not by itself fix very deep stacks |
| Residual connections | h ← h + f(h) makes each Jacobian I + Jf: identity is the default path, and the gradient reaches early layers through pure addition, untouched by any multiplication | The +f branches still sum; very deep residual nets scale branches (e.g. by 1/√L) to keep variance flat |
| LSTM / GRU gates | Replace multiplication through time with a gated additive cell state — the recurrent analogue of a residual, with learned gates deciding what survives | Helps to hundreds of steps, not thousands; attention eventually sidestepped recurrence entirely |
| Clipping | Caps the explosion's damage after it happens; changes nothing about why γ > 1 | Explosions only; no help whatsoever for vanishing — you cannot clip upward |
Notice the shared shape of the real fixes: every one of them converts a multiplicative path into an additive one, or pins the multiplicative factors to 1. Residuals do it in depth, gates do it in time, normalisation does it statistically, init does it at the starting line. The transformer recipe — residual stream + LayerNorm + careful init — is all four at once, which is why 100-layer transformers train and 100-layer plain MLPs do not.