General ML

Early Stopping

The regulariser that costs one validation set

01 · First principlesTwo curves, one decision

Training loss measures fit to the sample in hand, and a sufficiently flexible model can always improve that fit — so training loss falls essentially forever, ending in memorisation. The quantity we actually care about, error on unseen data, behaves differently: it falls while the model learns structure shared between training and validation data, then turns upward when further training fits only the training set's private noise. The entire method is one observation: hold out a validation set, watch its loss, and stop where it turns.

TRAINING EPOCHS → LOSS → training loss validation loss stop here still learning structure now memorising noise

Training loss falls monotonically; validation loss turns. The checkpoint at the turn is the model you keep.

02 · Why it worksIt is regularisation in disguise

Stopping early looks like giving up; it is actually a capacity constraint. Each gradient step moves the weights a bounded distance, so stopping after t steps confines the solution to a region around the initialisation — the optimiser is never allowed to wander far enough to carve the contorted functions that memorise noise. Fewer steps, smaller reachable set, lower effective capacity.

For linear models with quadratic loss this is a theorem, not a metaphor: gradient descent fits each curvature direction at a rate set by its eigenvalue, so truncating at time t leaves the low-curvature (noise-dominated) directions unfit — the same directions an L2 penalty suppresses most:

early stopping at step t  ≈  L2 penalty with λ ∝ 1/(ηt)

Train longer, regularise less. The number of epochs is a regularisation knob with the dial reversed, and early stopping is tuning that knob by direct measurement instead of by search. It belongs in the implicit row of the regularisation taxonomy — variance bought down at the price of a small bias toward the init, same as every other entry.

03 · MechanicsPatience and checkpoints

The implementation has exactly two moving parts, both there because validation curves are noisy:

  1. Patience. Do not stop at the first uptick — a noisy curve upticks constantly. Stop only after the validation metric has failed to improve for k consecutive evaluations (k ≈ 5–20 epochs depending on noise). Patience trades a little wasted compute for not being fooled by a wobble.
  2. Checkpointing. By the time patience expires you are k evaluations past the best model, so save weights whenever the metric improves and restore the best checkpoint at the end. Stopping without restoring keeps the wrong model.
  3. Evaluate at a sensible cadence (every epoch, or every n steps for large data), and smooth the metric (EMA) if the curve is jagged.

04 · The catchThe free lunch, itemised

Early stopping feels like a free lunch: no hyperparameter search over λ, no change to the loss, the optimal stopping point discovered automatically as a by-product of training. The feel is mostly justified — which is why it is the one regulariser used in essentially every serious training run. But the bill exists:

Mental Model