Applied ML

Gradient Clipping

A circuit breaker, not a cure

01 · First principlesOne bad step can end a run

Gradient descent's update is lr·g, and nothing in the algorithm bounds g. The loss landscape of a deep network contains cliffs — regions where curvature explodes and the local gradient is orders of magnitude larger than typical. A rare pathological batch (a corrupted document, a degenerate sequence), a loss cliff, or a transient numerical event can produce a gradient 100× the usual norm.

The asymmetry is what makes this dangerous: a thousand good steps build the model slowly, and one enormous step can throw the weights into a region from which the optimizer never recovers — the loss spikes, then plateaus at a worse level, or goes straight to NaN. Weeks of compute can hinge on the worst single batch in the corpus. Clipping exists because the cost of capping every step is tiny and the cost of not capping one step is occasionally everything.

02 · The mechanismClip the norm, keep the direction

Global-norm clipping treats the entire gradient — concatenated over all parameters — as one vector and rescales it only if its L2 norm exceeds a threshold c:

g  ←  g · min(1,  c / ‖g‖2)
‖g‖ ≤ c → untouched · ‖g‖ > c → same direction, length c

Two properties matter. First, below the threshold it is the identity — in a healthy run, clipping should be doing nothing most of the time. Second, above the threshold it preserves the gradient's direction exactly and shortens only its length. The step still points downhill; it is merely no longer allowed to be a leap.

03 · The contrastNorm vs value clipping

Clip by global norm
Rescales the whole vector uniformly. Direction preserved; relative magnitudes across layers preserved. One interpretable knob (c). The default for transformers and essentially all modern large-scale training. torch.nn.utils.clip_grad_norm_.
Clip by value
Clamps each coordinate into [−c, c] independently. Large coordinates saturate while small ones pass, so the update's direction changes — the step points somewhere the loss never asked for. Survives mainly in older RNN recipes and as a blunt NaN guard.

Element-wise clipping is not strictly wrong — it bounds the step too — but it distorts geometry in a way that is hard to reason about, and there is rarely a reason to prefer it when norm clipping is one line.

04 · PracticeThresholds, ordering, and the log you should keep

05 · The honest partClipping treats the symptom

A clipped step is a biased gradient estimate — you are no longer optimising the loss exactly, you are optimising it subject to a trust region. When clips are rare this bias is negligible and the insurance is nearly free. When clips are frequent, the run is telling you something, and the threshold is muffling the message:

If spikes are…Likely causeActual fix
Tied to specific batches, reproducibleDataFind and filter the offending examples; dedupe; cap document weirdness
Growing in frequency over the runLearning rate / scheduleLower peak LR, lengthen warmup; check Adam eps and β₂ at scale
Correlated with inf/nan, precision-dependentNumericsAudit fp16 range, reductions, attention logits (see precision tricks)
Structural — from the architecture's depthConditioningInit, normalisation, residual scaling (see exploding gradients)
Rule of thumb: clipping that fires on well under ~1% of steps is insurance; clipping that fires constantly is a diagnosis you have chosen not to read. Keep the clip, but go find the cause.
Mental Model