General ML

Second Order Methods

Curvature without paying for the Hessian

01 · First principlesThe problem this family shares

Newton's method sets the gold standard — precondition the gradient by inverse curvature, x ← x − H⁻¹∇L — and is unaffordable in high dimension: O(d²) to store H, O(d³) to solve with it. The whole field of practical second-order optimisation is one question asked repeatedly:

How much of H⁻¹∇L can we recover from things we can afford — gradients, model structure, or a diagonal?

Every method below is a different point on the curve of fidelity versus cost. It helps to read them as a family tree with one ancestor and progressively cheaper descendants.

02 · Quasi-NewtonBuild H⁻¹ from gradient differences

You cannot compute H, but you can observe its action for free: between two iterates, the gradient change is approximately the Hessian times the step.

∇L(x_t+1) − ∇L(x_t) ≈ H (x_t+1 − x_t) (the secant condition)

BFGS maintains a running estimate of H⁻¹ and after each step applies the cheapest rank-two update that makes the estimate consistent with the newly observed (step, gradient-change) pair. The estimate improves precisely along the directions the optimiser actually travels — curvature is learned where it is needed. Still O(d²) memory for the dense estimate, though, so L-BFGS keeps only the last m ≈ 10–20 pairs and applies the implied H⁻¹ as a sequence of inner products: O(md) memory, no matrix ever formed. L-BFGS remains the workhorse for full-batch, deterministic problems; mini-batch noise breaks the secant condition (the gradient change reflects sampling, not curvature), which is the main reason it lost in deep learning.

03 · Gauss–Newton & natural gradientCurvature of the model, not the loss

A different economy: for losses of the form ℓ(f_θ(x), y), the Hessian splits into a term built from first derivatives of the model and a term involving second derivatives. Gauss–Newton keeps only the first:

H ≈ J^T H_ℓ J (J = ∂f/∂θ; drops the second-derivative-of-f term)

The kept piece is always positive semi-definite — Gauss–Newton cannot be attracted to saddles, fixing Newton's nastiest behaviour for free. The natural gradient arrives at nearly the same matrix from a different principle: take the steepest step not in parameter space but in distribution space, measuring distance by KL divergence between the model's output distributions. The resulting preconditioner is the Fisher information matrix, which for common losses coincides with Gauss–Newton. Same object, two readings: curvature of the model, or geometry of the distributions it defines.

K-FAC makes the Fisher tractable for networks by assuming it is block-diagonal by layer, with each block a Kronecker product of two small matrices (input covariance ⊗ gradient covariance). Inverting the small factors stands in for inverting the block. It works, and it shows up in research code; the constant-factor overhead and implementation weight have kept it out of mainstream training.

04 · The degenerate survivorDiagonal preconditioning

Throw away everything off-diagonal and estimate the diagonal from gradient statistics: divide each coordinate by √EMA[g²]. That is Adam. Squint and its denominator is a diagonal, square-rooted, EMA-smoothed stand-in for curvature scale — the cheapest member of this family, costing O(d) and one elementwise divide.

Method	Approximates	From	Cost
Newton	H⁻¹ exactly	second derivatives	O(d²) mem, O(d³) solve
BFGS	dense H⁻¹ estimate	gradient differences	O(d²) mem
L-BFGS	rank-2m H⁻¹	last m gradient pairs	O(md)
Gauss–Newton / natural grad	PSD part of H / Fisher	model Jacobians	solve per step (CG or factored)
K-FAC	block-diag Kronecker Fisher	layer activation/grad stats	moderate; heavy to implement
Adam	diagonal scale only	EMA of g²	O(d)

05 · The verdictWhy first-order + tricks won at scale

On paper, better curvature means fewer iterations. At scale, three facts conspired against the sophisticated end of the table. Mini-batch noise corrupts curvature estimates worse than gradient estimates (and inverting a noisy matrix amplifies the noise). Hardware prefers the simple: an elementwise update is perfectly parallel and trivially sharded across devices, while Kronecker factor inversions and CG solves are not. And the iteration economics flipped — when each pass over the data is the bottleneck, a method that halves the iteration count but doubles the per-step cost wins nothing.

Meanwhile first-order methods absorbed cheap substitutes for what curvature buys: momentum smooths the zig-zag, normalisation layers and careful initialisation condition the landscape itself, and Adam's diagonal handles per-parameter scale. The landscape was repaired instead of the optimiser. Second-order ideas survive where problems are small, deterministic, and expensive per evaluation — and inside Adam, as a diagonal ghost of the Hessian.

Mental Model

One question organises the family: how much of H⁻¹∇L can be recovered without forming H?
Quasi-Newton learns H⁻¹ from gradient differences along the trajectory; L-BFGS keeps only the recent past.
Gauss–Newton / natural gradient use the model's own Jacobian (equivalently, KL geometry) — PSD by construction, no saddle attraction.
Adam is the degenerate survivor: a diagonal, square-rooted whisper of curvature at O(d) cost.
At scale, fixing the landscape (norms, init, momentum) beat fixing the optimiser.