General ML

Adam / AdamW / Adagrad

One learning rate per parameter, done carefully

01 · First principlesWhy one global η is wrong

SGD applies the same learning rate to every coordinate. But gradient scales differ wildly across parameters: embedding rows for rare tokens see gradients a thousand times smaller than dense early layers; biases and weights live on different scales. One global η must then be tuned for the steepest coordinate (or it diverges), which leaves the shallow coordinates crawling.

The wish is a per-parameter step size: divide each coordinate's step by some measure of how large its gradients typically are. Every optimiser in this note is one answer to "how do you measure typical?" — and each answer breaks in a specific way that motivates the next.

02 · AdagradAccumulate — and die

Adagrad's measure: the running sum of squared gradients per coordinate.

G_t = G_t−1 + g_t² θ_t+1 = θ_t − η · g_t / (√G_t + ε) (all elementwise)

Coordinates with consistently large gradients get their steps shrunk; rare-but-informative coordinates keep large steps. This genuinely works for sparse problems (it was built for them). The breakage is built into the accumulator: G_t only grows, so the effective learning rate η/√G_t decays toward zero whether or not you are done learning. On a long non-convex run, Adagrad quietly stops — learning-rate death by bookkeeping.

03 · RMSProp → AdamForget the distant past

RMSProp's fix is one character deep: replace the sum with an exponential moving average, so old gradients fade and the denominator tracks the recent gradient scale instead of all of history.

v_t = β₂ v_t−1 + (1−β₂) g_t²

Adam keeps that denominator and adds an EMA of the gradient itself (momentum, the first moment), plus one honest correction. Both EMAs start at zero, so early on they are biased toward zero — for v_t that would make the denominator tiny and the first steps enormous. Dividing by (1−β^t) removes the bias exactly:

m_t = β₁ m_t−1 + (1−β₁) g_t v_t = β₂ v_t−1 + (1−β₂) g_t²
m̂_t = m_t/(1−β₁^t) v̂_t = v_t/(1−β₂^t)
θ_t+1 = θ_t − η · m̂_t / (√v̂_t + ε)

Read the update as a unit: the step in each coordinate is roughly the signal-to-noise ratio of its gradient, capped near ±η. Adam steps are scale-free — multiply the loss by 1000 and the update does not change. That robustness is why it became the default; the tradeoff is that the adaptive denominator is itself a crude diagonal curvature guess (see second-order methods), and it can be wrong in ways SGD cannot.

04 · AdamWL2 in the update is not weight decay

Classically, "weight decay" and "L2 penalty" were synonyms: adding λ‖θ‖²/2 to the loss adds λθ to the gradient, which under plain SGD shrinks every weight by the same factor each step. Under Adam that equivalence silently breaks: the λθ term gets fed through the division by √v̂, so weights with large gradient history get almost no decay, and weights with small gradients get heavy decay. The regulariser's strength becomes a function of the optimiser's internal state — not what anyone asked for.

AdamW decouples it: keep λθ out of m and v entirely, and apply the shrinkage directly to the weights.

θ_t+1 = θ_t − η · m̂_t/(√v̂_t + ε) − η λ θ_t

Same intent, restored meaning, and in practice better generalisation — which is why transformer training defaults to AdamW rather than Adam + L2 (the regularisation side of this story lives in regularisation).

05 · Side by sideThe updates

Optimiser	Update (elementwise)	Fails how
SGD	`θ ← θ − η g`	one η for all scales
Adagrad	`θ ← θ − η g / √(Σ g²)`	accumulator only grows → LR death
RMSProp	`θ ← θ − η g / √EMA[g²]`	no momentum, biased early steps
Adam	`θ ← θ − η m̂ / √v̂`	L2 penalty mangled by the denominator
AdamW	`θ ← θ − η m̂ / √v̂ − ηλθ`	current default for transformers

Each row exists because the row above broke. The whole family is one idea — normalise steps by recent gradient scale — debugged four times.

Mental Model

The problem: gradient magnitudes differ per parameter, so one global η under-serves most coordinates.
Adagrad divides by total gradient history; the history only grows, so learning eventually stops.
RMSProp swaps the sum for an EMA; Adam adds momentum and bias correction on top.
Adam's step ≈ per-coordinate signal-to-noise ratio, capped near η; it is invariant to loss scale.
AdamW: adaptive optimisers mangle L2-in-the-gradient, so apply weight decay directly to the weights.