General ML

Overfitting / Underfitting

Two failures, one gap, one instrument

01 · First principlesThe two ways to fail at generalising

A model can fail unseen data from opposite directions. It can memorise: reproduce the training set, private noise and all, having learned the sample rather than the structure (overfitting). Or it can not even learn: lack the capacity or the training to capture structure that is plainly in the data (underfitting). These are the operational faces of the bias–variance tradeoff — underfitting is what high bias looks like in a training log, overfitting is what high variance looks like. That note explains where the error comes from; this one is about detecting which failure you have, because the remedies point in opposite directions and applying the wrong one makes things worse.

02 · DiagnosisEverything is in the gap

The diagnosis requires exactly two numbers: training error and validation error. Their absolute level and the gap between them place you in one of four quadrants.

Train errorValidation errorDiagnosisRead it as
highhigh (small gap)underfittingcannot even fit the data it has seen — capacity or optimisation problem
lowhigh (large gap)overfittingfits seen data, fails unseen — memorising sample noise
higheven higherbothtoo little capacity and unstable fit; common with bad features or bugs
lowlow (small gap)healthylearning structure that transfers — stop fiddling
The discipline: never diagnose from validation error alone. A validation error of 20% means nothing by itself — with training error at 19% it is a bias problem, at 2% a variance problem, and the fixes are opposites.

03 · The instrumentLearning curves

A single (train, val) pair is a snapshot; learning curves — error versus training set size — show the trend, and the trend tells you what more data would buy. The two failures produce unmistakably different pictures:

TRAIN SET SIZE → UNDERFITTING target val train converged, both high TRAIN SET SIZE → OVERFITTING target val · still falling train · near zero the gap = variance

Left: curves converge at a high plateau — more data cannot help, the model is the ceiling. Right: a persistent gap with the validation curve still descending — more data is exactly what would help.

This is the cheapest expensive-question-answerer in ML: before paying for data collection, plot the curves on subsamples. If train and validation have already converged (left panel), the money is wasted; the model family is the bottleneck. If a large gap persists and validation error is still falling with size (right panel), data is precisely the purchase to make.

04 · RemediesMapped to the failure

Underfitting · raise capacity or fit harder
Bigger or deeper model; better features; train longer with a tuned learning rate; reduce regularisation; check for optimisation bugs (bad init, saturating activations). More data will not help — the model cannot even use what it has.
Overfitting · constrain or dilute
More data or augmentation (dilutes the noise being memorised); regularisation — weight decay, dropout, label smoothing; early stopping; a smaller model; ensembling. Each buys bias to cut variance.
The classic mistake is treating every bad validation number as overfitting and reflexively adding dropout. Regularising an underfit model pushes both errors up. Read the gap first; the gap chooses the column.

05 · CaveatsWhere the clean picture blurs

Two modern footnotes. First, training error near zero is no longer automatically alarming: large overparameterised networks routinely interpolate their training data and still generalise (double descent; the implicit regularisation of SGD is doing unbilled work). The gap remains meaningful, but "train error ≈ 0" alone is not a diagnosis. Second, the entire framework rests on the validation set being honest — drawn from the deployment distribution and untouched by training. Leakage produces a small gap and a confident "healthy" verdict for a model that will fail in production; the hygiene rules live in cross-validation.

Mental Model