General ML

Bias–Variance Tradeoff

Where error comes from, and the knob that moves it

01 · First principlesWhy does a trained model err?

You train a model on a dataset D and it errs on a new point x. Why? Before naming any tradeoff, ask what could possibly go wrong. Only three things can:

  1. Your model family cannot represent the truth, so even the best member of the family is systematically off. (Bias)
  2. Your model family can represent the truth, but the particular D you drew pulled the fit somewhere else. Draw a different D, get a noticeably different model. (Variance)
  3. The target itself is noisy; even the true function errs. (Irreducible noise)
That is the whole topic. Everything else is bookkeeping.

02 · The decompositionThe bookkeeping

Fix a point x. Imagine retraining your model on many fresh datasets D, getting a predictor f̂D each time. For squared error, the expected error splits exactly:

ED,ε[(y − f̂D(x))²]  =  (f(x) − ED[f̂D(x)])²  +  ED[(f̂D(x) − ED[f̂D(x)])²]  +  σ²
Bias² Variance Noise

Read each term in plain words:

Note: this is an identity, not a law. It does not say bias and variance must trade off; it says total error splits into these parts. The tradeoff appears when one knob moves both.

03 · The mechanismThe knob: model capacity

Simple model · fit a line to a curve
The average fit is wrong everywhere → high bias. But every dataset gives nearly the same line → low variance. It is confidently wrong, consistently.
Complex model · degree-20 polynomial
The average fit can match the truth → low bias. But each dataset's noise drags the wiggles around → high variance. Capable of being right, but unstable.

Turning capacity up lowers bias and raises variance. Total error is U-shaped in capacity; the sweet spot is where the marginal drop in bias² equals the marginal rise in variance.

04 · Visualize itDartboard and U-curve

Dartboard. Truth is the bullseye; each dart is the model trained on one dataset. You only ever throw one dart (you have one dataset). Variance is the risk that your single dart lands far out even though the aim is true.

LOW VARIANCE HIGH VARIANCE LOW BIAS HIGH BIAS

Each dot is the same model class retrained on a fresh dataset. Bias = where the cluster sits; variance = how spread it is.

The U-curve. Training error falls monotonically with capacity. Test error falls (bias shrinking), bottoms out, then rises (variance taking over). The gap between the two curves is variance, roughly.

MODEL CAPACITY → ERROR → train error test error sweet spot underfitting (bias dominates) overfitting (variance dominates)

Train error always falls with capacity. Test error is U-shaped; the gap between curves is roughly variance.

05 · DiagnosisFailure first

SymptomProblemWhat helps
Train error high, test error high Bias Bigger model, better features, less regularisation. More data will not help.
Train error low, test error high Variance More data, regularisation, ensembling / bagging, early stopping, dropout.
Notice: almost every regularisation method is a variance-reduction device that deliberately buys a little bias.

06 · Modern caveatWhere the textbook story breaks

Deep networks past the interpolation point can see test error fall again (double descent): heavily overparameterised models, with implicit regularisation from SGD, behave like an average over many fits, taming variance without paying bias.

The decomposition still holds exactly; only the assumption "more capacity ⇒ more variance" breaks.

Mental Model