Curvature without paying for the Hessian
Newton's method sets the gold standard — precondition the gradient by inverse curvature, x ← x − H−1∇L — and is unaffordable in high dimension: O(d²) to store H, O(d³) to solve with it. The whole field of practical second-order optimisation is one question asked repeatedly:
Every method below is a different point on the curve of fidelity versus cost. It helps to read them as a family tree with one ancestor and progressively cheaper descendants.
You cannot compute H, but you can observe its action for free: between two iterates, the gradient change is approximately the Hessian times the step.
BFGS maintains a running estimate of H−1 and after each step applies the cheapest rank-two update that makes the estimate consistent with the newly observed (step, gradient-change) pair. The estimate improves precisely along the directions the optimiser actually travels — curvature is learned where it is needed. Still O(d²) memory for the dense estimate, though, so L-BFGS keeps only the last m ≈ 10–20 pairs and applies the implied H−1 as a sequence of inner products: O(md) memory, no matrix ever formed. L-BFGS remains the workhorse for full-batch, deterministic problems; mini-batch noise breaks the secant condition (the gradient change reflects sampling, not curvature), which is the main reason it lost in deep learning.
A different economy: for losses of the form ℓ(fθ(x), y), the Hessian splits into a term built from first derivatives of the model and a term involving second derivatives. Gauss–Newton keeps only the first:
The kept piece is always positive semi-definite — Gauss–Newton cannot be attracted to saddles, fixing Newton's nastiest behaviour for free. The natural gradient arrives at nearly the same matrix from a different principle: take the steepest step not in parameter space but in distribution space, measuring distance by KL divergence between the model's output distributions. The resulting preconditioner is the Fisher information matrix, which for common losses coincides with Gauss–Newton. Same object, two readings: curvature of the model, or geometry of the distributions it defines.
K-FAC makes the Fisher tractable for networks by assuming it is block-diagonal by layer, with each block a Kronecker product of two small matrices (input covariance ⊗ gradient covariance). Inverting the small factors stands in for inverting the block. It works, and it shows up in research code; the constant-factor overhead and implementation weight have kept it out of mainstream training.
Throw away everything off-diagonal and estimate the diagonal from gradient statistics: divide each coordinate by √EMA[g²]. That is Adam. Squint and its denominator is a diagonal, square-rooted, EMA-smoothed stand-in for curvature scale — the cheapest member of this family, costing O(d) and one elementwise divide.
| Method | Approximates | From | Cost |
|---|---|---|---|
| Newton | H⁻¹ exactly | second derivatives | O(d²) mem, O(d³) solve |
| BFGS | dense H⁻¹ estimate | gradient differences | O(d²) mem |
| L-BFGS | rank-2m H⁻¹ | last m gradient pairs | O(md) |
| Gauss–Newton / natural grad | PSD part of H / Fisher | model Jacobians | solve per step (CG or factored) |
| K-FAC | block-diag Kronecker Fisher | layer activation/grad stats | moderate; heavy to implement |
| Adam | diagonal scale only | EMA of g² | O(d) |
On paper, better curvature means fewer iterations. At scale, three facts conspired against the sophisticated end of the table. Mini-batch noise corrupts curvature estimates worse than gradient estimates (and inverting a noisy matrix amplifies the noise). Hardware prefers the simple: an elementwise update is perfectly parallel and trivially sharded across devices, while Kronecker factor inversions and CG solves are not. And the iteration economics flipped — when each pass over the data is the bottleneck, a method that halves the iteration count but doubles the per-step cost wins nothing.
Meanwhile first-order methods absorbed cheap substitutes for what curvature buys: momentum smooths the zig-zag, normalisation layers and careful initialisation condition the landscape itself, and Adam's diagonal handles per-parameter scale. The landscape was repaired instead of the optimiser. Second-order ideas survive where problems are small, deterministic, and expensive per evaluation — and inside Adam, as a diagonal ghost of the Hessian.