Linear Algebra

Hessian

The matrix that knows the shape of the bowl

01 · First principlesThe gradient is blind to shape

The gradient of a loss tells you which way is down, and nothing more. It cannot distinguish a narrow ravine from a wide bowl, a minimum from a saddle ahead, or how far the descent remains trustworthy. All of that is curvature — second-derivative information — and it lives in the Hessian, the symmetric matrix Hij = ∂²L/∂θᵢ∂θⱼ. Locally:

L(θ + δ) ≈ L(θ) + ∇Lᵀδ + ½ δᵀHδ
the quadratic term: how the slope itself changes as you move

Equivalently, H is the Jacobian of the gradient. Because H is symmetric, the spectral theorem applies: orthonormal eigendirections with real eigenvalues, and each λᵢ is the curvature of the loss along its eigendirection — the local landscape, fully described by one spectrum.

02 · Reading the spectrumCritical points, and the tyranny of saddles

At a critical point (∇L = 0), the eigenvalue signs are the whole verdict:

Spectrum of HGeometryVerdict
all λ > 0bowl in every direction (positive definite)local minimum
all λ < 0dome in every directionlocal maximum
mixed signsbowl some ways, dome otherssaddle point
some λ = 0flat valley directionsdegenerate — the test is silent there

Now the high-dimensional punchline. A minimum requires every one of n eigenvalues to be positive; if signs were even roughly coin-flips, that costs 2⁻ⁿ. In million-dimensional loss surfaces, almost every critical point is a saddle, and the classical fear of "getting stuck in a bad local minimum" largely gives way to "crawling slowly past saddles" — where the escape direction exists (the negative-λ eigenvector) but the gradient there is tiny. The λ = 0 row matters too: overparameterised networks have huge flat subspaces at their minima (the null space picture), which is why "the minimum" is really a connected valley.

03 · How it breaks GDCondition number and the zig-zag

Away from critical points, the Hessian decides how fast gradient descent can go. On a quadratic bowl, the error along each eigendirection contracts by |1 − ηλᵢ| per step. Stability demands η < 2/λ_max; progress along the flattest direction then crawls at 1 − ηλ_min. One step size must serve both, and the tension is summarised by a single ratio, the condition number κ = λ_max / λ_min — iterations to converge scale like κ.

GD: BOUNCES BETWEEN STEEP WALLS PRECONDITIONED: STRAIGHT DOWN THE VALLEY λ_min AXIS λ_max AXIS

Elongated contours (large κ). The gradient points across the valley, not along it, so GD ricochets; rescaling by curvature would walk straight to the minimum.

The picture explains the ricochet: the gradient is perpendicular to the contour, and on an elongated bowl that direction is mostly across the valley. Each step overshoots in the steep direction and barely advances in the flat one. Round bowl (κ ≈ 1): one step lands near the centre. This is the same κ that governs error amplification in linear solves — conditioning is one concept, visiting optimisation and numerics alike.

04 · The fix and its priceNewton's method rescales by H⁻¹

If the bowl's elongation is the problem, undo it. Minimising the local quadratic model exactly gives the Newton step:

δ = −H⁻¹∇L
divide each eigendirection's gradient by its own curvature

In the eigenbasis, this divides the component along vᵢ by λᵢ: steep directions are damped, flat directions amplified, the ellipse is rescaled into a circle, and κ effectively becomes 1 — quadratics are solved in one step. The fine print: H must be positive definite for the step to even point downhill (at a saddle, dividing by a negative λ attracts you to it), and an exact Hessian is unaffordable — n² entries to store and an O(n³) solve, with n in the millions.

05 · Why ML caresWhat survives at scale

Modern optimisers are a museum of partial Hessians — each keeps the curvature information it can afford:

  1. Diagonal preconditioners (Adam, RMSProp). Keep only a per-parameter curvature proxy (running second moments of gradients) and divide by its square root: a diagonal, always-PSD imitation of H⁻¹. Crude — it ignores all cross-parameter curvature — but it costs O(n) and tames the worst per-coordinate scale disparities.
  2. L-BFGS. Store the last k (gradient difference, step) pairs and implicitly build a rank-k picture of curvature: a low-memory Newton that never forms H.
  3. Hessian-vector products. The full H is unaffordable, but Hv costs about one extra backprop (differentiate ∇Lᵀv) — enough for power iteration on λ_max, saddle-escape directions, and conjugate-gradient Newton steps.
  4. Gauss–Newton / natural gradient. Replace H with the born-PSD JᵀJ or Fisher matrix, trading exactness for guaranteed-downhill geometry.
The named connection: Adam is the field's working compromise on this entire note — a diagonal estimate of curvature, because κ = λ_max/λ_min punishes plain GD and the true H⁻¹ is forever out of budget.
Mental Model