General ML

Early Stopping

The regulariser that costs one validation set

01 · First principlesTwo curves, one decision

Training loss measures fit to the sample in hand, and a sufficiently flexible model can always improve that fit — so training loss falls essentially forever, ending in memorisation. The quantity we actually care about, error on unseen data, behaves differently: it falls while the model learns structure shared between training and validation data, then turns upward when further training fits only the training set's private noise. The entire method is one observation: hold out a validation set, watch its loss, and stop where it turns.

Training loss falls monotonically; validation loss turns. The checkpoint at the turn is the model you keep.

02 · Why it worksIt is regularisation in disguise

Stopping early looks like giving up; it is actually a capacity constraint. Each gradient step moves the weights a bounded distance, so stopping after t steps confines the solution to a region around the initialisation — the optimiser is never allowed to wander far enough to carve the contorted functions that memorise noise. Fewer steps, smaller reachable set, lower effective capacity.

For linear models with quadratic loss this is a theorem, not a metaphor: gradient descent fits each curvature direction at a rate set by its eigenvalue, so truncating at time t leaves the low-curvature (noise-dominated) directions unfit — the same directions an L2 penalty suppresses most:

early stopping at step t ≈ L2 penalty with λ ∝ 1/(ηt)

Train longer, regularise less. The number of epochs is a regularisation knob with the dial reversed, and early stopping is tuning that knob by direct measurement instead of by search. It belongs in the implicit row of the regularisation taxonomy — variance bought down at the price of a small bias toward the init, same as every other entry.

03 · MechanicsPatience and checkpoints

The implementation has exactly two moving parts, both there because validation curves are noisy:

Patience. Do not stop at the first uptick — a noisy curve upticks constantly. Stop only after the validation metric has failed to improve for k consecutive evaluations (k ≈ 5–20 epochs depending on noise). Patience trades a little wasted compute for not being fooled by a wobble.
Checkpointing. By the time patience expires you are k evaluations past the best model, so save weights whenever the metric improves and restore the best checkpoint at the end. Stopping without restoring keeps the wrong model.
Evaluate at a sensible cadence (every epoch, or every n steps for large data), and smooth the metric (EMA) if the curve is jagged.

04 · The catchThe free lunch, itemised

Early stopping feels like a free lunch: no hyperparameter search over λ, no change to the loss, the optimal stopping point discovered automatically as a by-product of training. The feel is mostly justified — which is why it is the one regulariser used in essentially every serious training run. But the bill exists:

It consumes a clean validation set. Data spent on validation is not trained on, and a validation set that leaks training information (duplicates, temporal overlap — see cross-validation) produces a stopping signal pointing at the wrong epoch.
Noisy curves mislead. Small validation sets give high-variance loss estimates; patience helps but cannot fully reliablise a 500-example signal.
Repeated decisions overfit the validation set. Stopping is a choice made by peeking at validation data; combined with heavy hyperparameter tuning on the same set, the "unseen" data quietly stops being unseen.
Double descent complicates the story. In some large-model regimes validation loss dips, rises, then falls again later; a patient run stopped at the first turn would miss the second descent. Rare, but the turn is no longer guaranteed to be unique.

Mental Model

Training loss falls forever; validation loss turns where learning becomes memorising. Stop at the turn.
It regularises by bounding travel from the init — for linear models, provably ≈ an L2 penalty with λ ∝ 1/(ηt).
Mechanics: patience absorbs curve noise, checkpoints let you return to the actual best model.
The lunch is nearly free, but it is paid for by a clean validation set — and by every other decision you also make on it.