One learning rate per parameter, done carefully
SGD applies the same learning rate to every coordinate. But gradient scales differ wildly across parameters: embedding rows for rare tokens see gradients a thousand times smaller than dense early layers; biases and weights live on different scales. One global η must then be tuned for the steepest coordinate (or it diverges), which leaves the shallow coordinates crawling.
The wish is a per-parameter step size: divide each coordinate's step by some measure of how large its gradients typically are. Every optimiser in this note is one answer to "how do you measure typical?" — and each answer breaks in a specific way that motivates the next.
Adagrad's measure: the running sum of squared gradients per coordinate.
Coordinates with consistently large gradients get their steps shrunk; rare-but-informative coordinates keep large steps. This genuinely works for sparse problems (it was built for them). The breakage is built into the accumulator: Gt only grows, so the effective learning rate η/√Gt decays toward zero whether or not you are done learning. On a long non-convex run, Adagrad quietly stops — learning-rate death by bookkeeping.
RMSProp's fix is one character deep: replace the sum with an exponential moving average, so old gradients fade and the denominator tracks the recent gradient scale instead of all of history.
Adam keeps that denominator and adds an EMA of the gradient itself (momentum, the first moment), plus one honest correction. Both EMAs start at zero, so early on they are biased toward zero — for vt that would make the denominator tiny and the first steps enormous. Dividing by (1−βt) removes the bias exactly:
Read the update as a unit: the step in each coordinate is roughly the signal-to-noise ratio of its gradient, capped near ±η. Adam steps are scale-free — multiply the loss by 1000 and the update does not change. That robustness is why it became the default; the tradeoff is that the adaptive denominator is itself a crude diagonal curvature guess (see second-order methods), and it can be wrong in ways SGD cannot.
Classically, "weight decay" and "L2 penalty" were synonyms: adding λ‖θ‖²/2 to the loss adds λθ to the gradient, which under plain SGD shrinks every weight by the same factor each step. Under Adam that equivalence silently breaks: the λθ term gets fed through the division by √v̂, so weights with large gradient history get almost no decay, and weights with small gradients get heavy decay. The regulariser's strength becomes a function of the optimiser's internal state — not what anyone asked for.
AdamW decouples it: keep λθ out of m and v entirely, and apply the shrinkage directly to the weights.
Same intent, restored meaning, and in practice better generalisation — which is why transformer training defaults to AdamW rather than Adam + L2 (the regularisation side of this story lives in regularisation).
| Optimiser | Update (elementwise) | Fails how |
|---|---|---|
| SGD | θ ← θ − η g |
one η for all scales |
| Adagrad | θ ← θ − η g / √(Σ g²) |
accumulator only grows → LR death |
| RMSProp | θ ← θ − η g / √EMA[g²] |
no momentum, biased early steps |
| Adam | θ ← θ − η m̂ / √v̂ |
L2 penalty mangled by the denominator |
| AdamW | θ ← θ − η m̂ / √v̂ − ηλθ |
current default for transformers |