Where error comes from, and the knob that moves it
01 · First principlesWhy does a trained model err?
You train a model on a dataset D and it errs on a new point x. Why? Before naming any tradeoff, ask what could possibly go wrong. Only three things can:
Your model family cannot represent the truth, so even the best member of the family is systematically off. (Bias)
Your model family can represent the truth, but the particular D you drew pulled the fit somewhere else. Draw a different D, get a noticeably different model. (Variance)
The target itself is noisy; even the true function errs. (Irreducible noise)
That is the whole topic. Everything else is bookkeeping.
02 · The decompositionThe bookkeeping
Fix a point x. Imagine retraining your model on many fresh datasets D, getting a predictor f̂D each time. For squared error, the expected error splits exactly:
Bias = how far the average model (averaged over all possible training sets) is from the truth. It survives infinite data of the same kind; it is the error of your assumptions.
Variance = how much an individual trained model wobbles around that average. It is the error of trusting one finite sample too much.
σ² = the floor. No model touches it.
Note: this is an identity, not a law. It does not say bias and variance must trade off; it says total error splits into these parts. The tradeoff appears when one knob moves both.
03 · The mechanismThe knob: model capacity
Simple model · fit a line to a curve
The average fit is wrong everywhere → high bias. But every dataset gives nearly the same line → low variance. It is confidently wrong, consistently.
Complex model · degree-20 polynomial
The average fit can match the truth → low bias. But each dataset's noise drags the wiggles around → high variance. Capable of being right, but unstable.
Turning capacity up lowers bias and raises variance. Total error is U-shaped in capacity; the sweet spot is where the marginal drop in bias² equals the marginal rise in variance.
04 · Visualize itDartboard and U-curve
Dartboard. Truth is the bullseye; each dart is the model trained on one dataset. You only ever throw one dart (you have one dataset). Variance is the risk that your single dart lands far out even though the aim is true.
Each dot is the same model class retrained on a fresh dataset. Bias = where the cluster sits; variance = how spread it is.
The U-curve. Training error falls monotonically with capacity. Test error falls (bias shrinking), bottoms out, then rises (variance taking over). The gap between the two curves is variance, roughly.
Train error always falls with capacity. Test error is U-shaped; the gap between curves is roughly variance.
05 · DiagnosisFailure first
Symptom
Problem
What helps
Train error high, test error high
Bias
Bigger model, better features, less regularisation. More data will not help.
Train error low, test error high
Variance
More data, regularisation, ensembling / bagging, early stopping, dropout.
Notice: almost every regularisation method is a variance-reduction device that deliberately buys a little bias.
06 · Modern caveatWhere the textbook story breaks
Deep networks past the interpolation point can see test error fall again (double descent): heavily overparameterised models, with implicit regularisation from SGD, behave like an average over many fits, taming variance without paying bias.
The decomposition still holds exactly; only the assumption "more capacity ⇒ more variance" breaks.
Mental Model
Expected error = (error of your assumptions)² + (error of trusting one sample) + noise.
Capacity is a knob that pours error from the first bucket into the second.
Diagnose by the train/test gap: no gap and both high → fix assumptions; big gap → fix trust (regularise, get data, average).
The decomposition is always true; the tradeoff is only usually true.