General ML

Overfitting / Underfitting

Two failures, one gap, one instrument

01 · First principlesThe two ways to fail at generalising

A model can fail unseen data from opposite directions. It can memorise: reproduce the training set, private noise and all, having learned the sample rather than the structure (overfitting). Or it can not even learn: lack the capacity or the training to capture structure that is plainly in the data (underfitting). These are the operational faces of the bias–variance tradeoff — underfitting is what high bias looks like in a training log, overfitting is what high variance looks like. That note explains where the error comes from; this one is about detecting which failure you have, because the remedies point in opposite directions and applying the wrong one makes things worse.

02 · DiagnosisEverything is in the gap

The diagnosis requires exactly two numbers: training error and validation error. Their absolute level and the gap between them place you in one of four quadrants.

Train error	Validation error	Diagnosis	Read it as
high	high (small gap)	underfitting	cannot even fit the data it has seen — capacity or optimisation problem
low	high (large gap)	overfitting	fits seen data, fails unseen — memorising sample noise
high	even higher	both	too little capacity and unstable fit; common with bad features or bugs
low	low (small gap)	healthy	learning structure that transfers — stop fiddling

The discipline: never diagnose from validation error alone. A validation error of 20% means nothing by itself — with training error at 19% it is a bias problem, at 2% a variance problem, and the fixes are opposites.

03 · The instrumentLearning curves

A single (train, val) pair is a snapshot; learning curves — error versus training set size — show the trend, and the trend tells you what more data would buy. The two failures produce unmistakably different pictures:

Left: curves converge at a high plateau — more data cannot help, the model is the ceiling. Right: a persistent gap with the validation curve still descending — more data is exactly what would help.

This is the cheapest expensive-question-answerer in ML: before paying for data collection, plot the curves on subsamples. If train and validation have already converged (left panel), the money is wasted; the model family is the bottleneck. If a large gap persists and validation error is still falling with size (right panel), data is precisely the purchase to make.

04 · RemediesMapped to the failure

Underfitting · raise capacity or fit harder

Bigger or deeper model; better features; train longer with a tuned learning rate; reduce regularisation; check for optimisation bugs (bad init, saturating activations). More data will not help — the model cannot even use what it has.

Overfitting · constrain or dilute

More data or augmentation (dilutes the noise being memorised); regularisation — weight decay, dropout, label smoothing; early stopping; a smaller model; ensembling. Each buys bias to cut variance.

The classic mistake is treating every bad validation number as overfitting and reflexively adding dropout. Regularising an underfit model pushes both errors up. Read the gap first; the gap chooses the column.

05 · CaveatsWhere the clean picture blurs

Two modern footnotes. First, training error near zero is no longer automatically alarming: large overparameterised networks routinely interpolate their training data and still generalise (double descent; the implicit regularisation of SGD is doing unbilled work). The gap remains meaningful, but "train error ≈ 0" alone is not a diagnosis. Second, the entire framework rests on the validation set being honest — drawn from the deployment distribution and untouched by training. Leakage produces a small gap and a confident "healthy" verdict for a model that will fail in production; the hygiene rules live in cross-validation.

Mental Model

Overfitting = memorising the sample; underfitting = failing to learn the structure. High variance and high bias, respectively, as seen in a training log.
Diagnose only from the pair (train error, gap) — never from validation error alone.
Learning curves answer the expensive question: converged-and-high means buy capacity, gapped-and-falling means buy data.
Remedies are opposites; the reflexive "add regularisation" actively harms an underfit model.
The whole instrument rests on a clean validation set, and interpolating big nets have blurred (not erased) the old rules.