General ML

Adam / AdamW / Adagrad

One learning rate per parameter, done carefully

01 · First principlesWhy one global η is wrong

SGD applies the same learning rate to every coordinate. But gradient scales differ wildly across parameters: embedding rows for rare tokens see gradients a thousand times smaller than dense early layers; biases and weights live on different scales. One global η must then be tuned for the steepest coordinate (or it diverges), which leaves the shallow coordinates crawling.

The wish is a per-parameter step size: divide each coordinate's step by some measure of how large its gradients typically are. Every optimiser in this note is one answer to "how do you measure typical?" — and each answer breaks in a specific way that motivates the next.

02 · AdagradAccumulate — and die

Adagrad's measure: the running sum of squared gradients per coordinate.

Gt = Gt−1 + gt2     θt+1 = θt − η · gt / (√Gt + ε)   (all elementwise)

Coordinates with consistently large gradients get their steps shrunk; rare-but-informative coordinates keep large steps. This genuinely works for sparse problems (it was built for them). The breakage is built into the accumulator: Gt only grows, so the effective learning rate η/√Gt decays toward zero whether or not you are done learning. On a long non-convex run, Adagrad quietly stops — learning-rate death by bookkeeping.

03 · RMSProp → AdamForget the distant past

RMSProp's fix is one character deep: replace the sum with an exponential moving average, so old gradients fade and the denominator tracks the recent gradient scale instead of all of history.

vt = β2 vt−1 + (1−β2) gt2

Adam keeps that denominator and adds an EMA of the gradient itself (momentum, the first moment), plus one honest correction. Both EMAs start at zero, so early on they are biased toward zero — for vt that would make the denominator tiny and the first steps enormous. Dividing by (1−βt) removes the bias exactly:

mt = β1 mt−1 + (1−β1) gt     vt = β2 vt−1 + (1−β2) gt2
t = mt/(1−β1t)     v̂t = vt/(1−β2t)
θt+1 = θt − η · m̂t / (√v̂t + ε)

Read the update as a unit: the step in each coordinate is roughly the signal-to-noise ratio of its gradient, capped near ±η. Adam steps are scale-free — multiply the loss by 1000 and the update does not change. That robustness is why it became the default; the tradeoff is that the adaptive denominator is itself a crude diagonal curvature guess (see second-order methods), and it can be wrong in ways SGD cannot.

04 · AdamWL2 in the update is not weight decay

Classically, "weight decay" and "L2 penalty" were synonyms: adding λ‖θ‖²/2 to the loss adds λθ to the gradient, which under plain SGD shrinks every weight by the same factor each step. Under Adam that equivalence silently breaks: the λθ term gets fed through the division by √v̂, so weights with large gradient history get almost no decay, and weights with small gradients get heavy decay. The regulariser's strength becomes a function of the optimiser's internal state — not what anyone asked for.

AdamW decouples it: keep λθ out of m and v entirely, and apply the shrinkage directly to the weights.

θt+1 = θt − η · m̂t/(√v̂t + ε) − η λ θt

Same intent, restored meaning, and in practice better generalisation — which is why transformer training defaults to AdamW rather than Adam + L2 (the regularisation side of this story lives in regularisation).

05 · Side by sideThe updates

OptimiserUpdate (elementwise)Fails how
SGD θ ← θ − η g one η for all scales
Adagrad θ ← θ − η g / √(Σ g²) accumulator only grows → LR death
RMSProp θ ← θ − η g / √EMA[g²] no momentum, biased early steps
Adam θ ← θ − η m̂ / √v̂ L2 penalty mangled by the denominator
AdamW θ ← θ − η m̂ / √v̂ − ηλθ current default for transformers
Each row exists because the row above broke. The whole family is one idea — normalise steps by recent gradient scale — debugged four times.
Mental Model