General ML

Gradient Descent / SGD

Cheap noisy steps downhill, and why they win

01 · First principlesWhy follow the gradient at all?

We have a loss L(θ) over millions of parameters and we cannot solve ∇L = 0 in closed form. The only thing we can afford to compute at a point is local: the gradient. So we ask the local question — among all unit directions d, which one decreases L fastest right here?

L(θ + ηd) ≈ L(θ) + η ∇LTd,   minimised over ‖d‖=1 by   d = −∇L / ‖∇L‖

That is the entire justification: −∇L is the locally steepest descent direction. Nothing about it is global; the gradient knows the terrain only at the point where it stands. Hence the iteration:

θt+1 = θt − η ∇L(θt)

The analogy is honest, not decorative: descending a mountain in fog. You cannot see the valley; you can feel the slope under your feet, so you step downhill and re-check. The step size η decides whether you walk or fall.

02 · The naive version breaksFull-batch is exact and unaffordable

The training loss is an average over n examples, so the honest gradient is too:

∇L(θ) = (1/n) Σi=1n ∇ℓ(θ; xi, yi)

Full-batch gradient descent computes this exactly. The breakage is cost, not correctness: with n in the billions, one step costs one pass over the entire dataset. You get a perfect direction you can afford to use a handful of times. Worse, the precision is wasted — far from the optimum, a rough estimate of "downhill" is all you need; exactness only matters near the end.

03 · The fixSGD: estimate the gradient, do not compute it

Sample one example (or a mini-batch B) and use its gradient instead. It is an unbiased estimate of the true gradient: E[∇ℓi] = ∇L. Each step is now n/|B| times cheaper, so in the time full-batch takes one exact step, SGD takes thousands of noisy ones. On wall-clock — the only clock that matters — noisy and many beats exact and few.

θt+1 = θt − η · (1/|B|) Σi∈B ∇ℓit)  =  θt − η(∇L + noise)

The tradeoff the fix introduces: the iterates never settle. Near a minimum the noise keeps kicking the parameters around, so plain SGD with fixed η converges to a fuzzy ball around the optimum, not a point (decaying η shrinks the ball). But the noise turns out to be partly a feature: it bounces the iterates out of sharp, narrow minima and lets them rest only in flat ones, which tend to generalise better. SGD is an optimiser with a built-in regulariser.

The deep-learning bargain: we accepted gradient noise to buy speed, and got generalisation thrown in.

04 · The one hyperparameterLearning rate

η is THE hyperparameter. On a quadratic bowl the dynamics are exactly three regimes: too small and you crawl (each step removes a fixed fraction of the remaining distance, very slowly); just right and you converge fast; too large and each step overshoots the minimum by more than it gained, so the error grows geometrically — divergence.

η TOO SMALL crawls η RIGHT converges η TOO LARGE diverges

Same bowl, three step sizes. Overshooting by more than you descend compounds — the red path climbs out.

In practice nobody holds η fixed: warm up (large early steps on a random init are dangerous), then decay (cosine or step schedules) so the noise ball shrinks as you approach the answer.

05 · RefinementsMini-batches and momentum

Mini-batches are the engineering compromise: averaging |B| gradients cuts noise variance by 1/|B|, and GPUs compute a batch nearly as fast as a single example, so batching is almost free variance reduction. Past the point where hardware saturates, bigger batches stop paying.

Momentum in two lines: keep a running average of gradients and step along it.

vt = β vt−1 + ∇L(θt)     θt+1 = θt − η vt

Components of the gradient that persist across steps (the valley floor) accumulate; components that flip sign (the steep walls, the noise) cancel. A heavy ball rolling downhill instead of a hiker who forgets each step. This is also the first moment inside Adam; what SGD still lacks — a per-parameter step size — is that note's subject, and what it ignores about curvature is Newton's method.

Mental Model