Applied ML

Gradient Clipping

A circuit breaker, not a cure

01 · First principlesOne bad step can end a run

Gradient descent's update is lr·g, and nothing in the algorithm bounds g. The loss landscape of a deep network contains cliffs — regions where curvature explodes and the local gradient is orders of magnitude larger than typical. A rare pathological batch (a corrupted document, a degenerate sequence), a loss cliff, or a transient numerical event can produce a gradient 100× the usual norm.

The asymmetry is what makes this dangerous: a thousand good steps build the model slowly, and one enormous step can throw the weights into a region from which the optimizer never recovers — the loss spikes, then plateaus at a worse level, or goes straight to NaN. Weeks of compute can hinge on the worst single batch in the corpus. Clipping exists because the cost of capping every step is tiny and the cost of not capping one step is occasionally everything.

02 · The mechanismClip the norm, keep the direction

Global-norm clipping treats the entire gradient — concatenated over all parameters — as one vector and rescales it only if its L2 norm exceeds a threshold c:

g ← g · min(1, c / ‖g‖₂)

‖g‖ ≤ c → untouched · ‖g‖ > c → same direction, length c

Two properties matter. First, below the threshold it is the identity — in a healthy run, clipping should be doing nothing most of the time. Second, above the threshold it preserves the gradient's direction exactly and shortens only its length. The step still points downhill; it is merely no longer allowed to be a leap.

03 · The contrastNorm vs value clipping

Clip by global norm

Rescales the whole vector uniformly. Direction preserved; relative magnitudes across layers preserved. One interpretable knob (c). The default for transformers and essentially all modern large-scale training. torch.nn.utils.clip_grad_norm_.

Clip by value

Clamps each coordinate into [−c, c] independently. Large coordinates saturate while small ones pass, so the update's direction changes — the step points somewhere the loss never asked for. Survives mainly in older RNN recipes and as a blunt NaN guard.

Element-wise clipping is not strictly wrong — it bounds the step too — but it distorts geometry in a way that is hard to reason about, and there is rarely a reason to prefer it when norm clipping is one line.

04 · PracticeThresholds, ordering, and the log you should keep

Threshold: max-norm ≈ 1.0 is the standard transformer default (GPT-3, LLaMA, and most things since); 0.5 to 5.0 all appear in respectable recipes. The exact value matters less than people expect, because a healthy run sits well under it.
Ordering with mixed precision: unscale the gradients before measuring the norm, or you are comparing S·‖g‖ against c and the threshold means nothing (see loss scaling). scaler.unscale_(optimizer) first, then clip, then step.
Ordering with distribution: clip after gradient synchronisation — after the allreduce in DDP, and with the sharded-aware utility under FSDP (each rank holds 1/N of the gradient, so the norm itself requires a small collective to compute).
Log the pre-clip norm every step. It is the cheapest health signal in training: a slow upward drift predicts instability before the loss shows it, and the fraction of clipped steps tells you whether clipping is a circuit breaker (rare trips) or a load-bearing wall (constant trips).

05 · The honest partClipping treats the symptom

A clipped step is a biased gradient estimate — you are no longer optimising the loss exactly, you are optimising it subject to a trust region. When clips are rare this bias is negligible and the insurance is nearly free. When clips are frequent, the run is telling you something, and the threshold is muffling the message:

If spikes are…	Likely cause	Actual fix
Tied to specific batches, reproducible	Data	Find and filter the offending examples; dedupe; cap document weirdness
Growing in frequency over the run	Learning rate / schedule	Lower peak LR, lengthen warmup; check Adam eps and β₂ at scale
Correlated with inf/nan, precision-dependent	Numerics	Audit fp16 range, reductions, attention logits (see precision tricks)
Structural — from the architecture's depth	Conditioning	Init, normalisation, residual scaling (see exploding gradients)

Rule of thumb: clipping that fires on well under ~1% of steps is insurance; clipping that fires constantly is a diagnosis you have chosen not to read. Keep the clip, but go find the cause.

Mental Model

Nothing bounds the gradient, and one unbounded step can spend the whole run's progress; clipping caps the downside for almost nothing.
Global-norm clipping shortens the step but keeps its direction; value clipping bends the direction — prefer the norm.
max-norm ≈ 1.0, applied after unscaling and after gradient sync.
The pre-clip norm is a free diagnostic; log it always.
Rare clips are insurance; frequent clips are a symptom — of data, LR, numerics, or conditioning — wearing a bandage.