Fit a quadratic, jump to its bottom
The gradient is a first-order object: it knows which way is downhill, not how the slope is changing. So gradient descent takes the same kind of step in a gentle valley and on a cliff face, and the user compensates by hand-tuning η. In a long narrow valley (badly conditioned curvature, the subject of data whitening) it zig-zags across the steep direction while inching along the shallow one.
The information GD is missing is curvature — the second derivative. Newton's question: if we are willing to pay for the Hessian, can we skip the tuning entirely?
Take the second-order Taylor expansion at the current point — a quadratic bowl that matches the true loss in value, slope, and curvature — and jump straight to the bottom of that bowl:
Notice what H−1 does to the geometry: it rescales each curvature direction by its own stiffness — big steps along flat directions, small steps across steep ones. Newton's method is gradient descent with a perfect, automatically chosen, matrix-valued learning rate. On an exactly quadratic loss it converges in one step from anywhere.
One Newton step: fit the dashed quadratic at xt, jump to its minimum. Near the optimum the fit is nearly exact, so the jumps land nearly on target.
Near a well-behaved minimum the error roughly squares each iteration:
Squaring the error doubles the number of correct digits per step: 10−2 → 10−4 → 10−8. Gradient descent, by contrast, multiplies the error by a constant factor each step (linear convergence) and pays more the worse the conditioning. When Newton applies — smooth loss, modest dimension, near a minimum — nothing first-order competes. Logistic regression solvers and many classical statistics fits are Newton (or its cousin, iteratively reweighted least squares) for exactly this reason.
The costs are all in the Hessian, and they scale brutally with dimension d:
Also, a stochastic Hessian estimated on a mini-batch is far noisier than a stochastic gradient — inverting noise amplifies it. Newton's precision is wasted on the noisy, non-convex, billion-dimensional setting.
| Setting | Newton? | Why |
|---|---|---|
| Logistic regression, GLMs, small convex fits | Yes | convex, d small, quadratic convergence shines |
| Classical optimisation, root finding | Yes | the default, with damping or trust regions |
| Deep networks | No | O(d²) memory, O(d³) solve, saddles, mini-batch noise |