General ML

Bias–Variance Tradeoff

Where error comes from, and the knob that moves it

01 · First principlesWhy does a trained model err?

You train a model on a dataset D and it errs on a new point x. Why? Before naming any tradeoff, ask what could possibly go wrong. Only three things can:

Your model family cannot represent the truth, so even the best member of the family is systematically off. (Bias)
Your model family can represent the truth, but the particular D you drew pulled the fit somewhere else. Draw a different D, get a noticeably different model. (Variance)
The target itself is noisy; even the true function errs. (Irreducible noise)

That is the whole topic. Everything else is bookkeeping.

02 · The decompositionThe bookkeeping

Fix a point x. Imagine retraining your model on many fresh datasets D, getting a predictor f̂_D each time. For squared error, the expected error splits exactly:

E_D,ε[(y − f̂_D(x))²] = (f(x) − E_D[f̂_D(x)])² + E_D[(f̂_D(x) − E_D[f̂_D(x)])²] + σ²

Bias² Variance Noise

Read each term in plain words:

Bias = how far the average model (averaged over all possible training sets) is from the truth. It survives infinite data of the same kind; it is the error of your assumptions.
Variance = how much an individual trained model wobbles around that average. It is the error of trusting one finite sample too much.
σ² = the floor. No model touches it.

Note: this is an identity, not a law. It does not say bias and variance must trade off; it says total error splits into these parts. The tradeoff appears when one knob moves both.

03 · The mechanismThe knob: model capacity

Simple model · fit a line to a curve

The average fit is wrong everywhere → high bias. But every dataset gives nearly the same line → low variance. It is confidently wrong, consistently.

Complex model · degree-20 polynomial

The average fit can match the truth → low bias. But each dataset's noise drags the wiggles around → high variance. Capable of being right, but unstable.

Turning capacity up lowers bias and raises variance. Total error is U-shaped in capacity; the sweet spot is where the marginal drop in bias² equals the marginal rise in variance.

04 · Visualize itDartboard and U-curve

Dartboard. Truth is the bullseye; each dart is the model trained on one dataset. You only ever throw one dart (you have one dataset). Variance is the risk that your single dart lands far out even though the aim is true.

Each dot is the same model class retrained on a fresh dataset. Bias = where the cluster sits; variance = how spread it is.

The U-curve. Training error falls monotonically with capacity. Test error falls (bias shrinking), bottoms out, then rises (variance taking over). The gap between the two curves is variance, roughly.

Train error always falls with capacity. Test error is U-shaped; the gap between curves is roughly variance.

05 · DiagnosisFailure first

Symptom	Problem	What helps
Train error high, test error high	Bias	Bigger model, better features, less regularisation. More data will not help.
Train error low, test error high	Variance	More data, regularisation, ensembling / bagging, early stopping, dropout.

Notice: almost every regularisation method is a variance-reduction device that deliberately buys a little bias.

06 · Modern caveatWhere the textbook story breaks

Deep networks past the interpolation point can see test error fall again (double descent): heavily overparameterised models, with implicit regularisation from SGD, behave like an average over many fits, taming variance without paying bias.

The decomposition still holds exactly; only the assumption "more capacity ⇒ more variance" breaks.

Mental Model

Expected error = (error of your assumptions)² + (error of trusting one sample) + noise.
Capacity is a knob that pours error from the first bucket into the second.
Diagnose by the train/test gap: no gap and both high → fix assumptions; big gap → fix trust (regularise, get data, average).
The decomposition is always true; the tradeoff is only usually true.