General ML

Ensembles

Many imperfect models, one good answer — if they disagree usefully

01 · First principlesWhy should averaging models work at all?

Ask a thousand people to guess the weight of an ox and the average guess is eerily accurate — provided their errors point in different directions, so that averaging cancels them. The same logic applies to models: each trained model is truth plus an error term, and if the error terms are not all the same error, combining models shrinks the error that remains.

The condition in italics is the entire subject. A committee of clones is just one model with extra compute. Ensembling works exactly to the degree that the members are individually decent and mutually diverse — competent enough to be right on average, different enough to be wrong in different places.

02 · The equationVariance of the mean, with correlation

Make it exact. Take n models whose predictions at a point each have variance σ² and pairwise correlation ρ (over redraws of training data — the variance of the bias–variance decomposition). The variance of their average is:

Var( (1/n) Σ f̂i )  =  ρσ²  +  (1 − ρ)σ² / n
correlated part — averaging cannot touch it independent part — dies as 1/n

Read it slowly, because this one line is the whole field. The independent share of the error vanishes as you add members. The correlated share survives no matter how many models you train: with ρ = 1 the ensemble equals one model; with ρ = 0 variance falls all the way to σ²/n. Every ensemble method ever invented is a scheme for pushing ρ down without pushing individual quality (σ², and bias) up too much.

Consequence: past a few hundred members, adding more identical-recipe models does almost nothing — the ρσ² floor has been reached. To improve further you must diversify differently, not multiply harder.

03 · Sources of diversityWhere low ρ comes from

LeverMechanismCanonical method
Resample the dataEach model sees a different bootstrap draw, so each overfits different noise.Bagging
Subsample the featuresModels are forbidden from all leaning on the same dominant feature.Random forests (bagging + per-split feature subsets)
Change the objective per memberEach model is trained on what the previous ones still get wrong — diversity by construction, aimed at bias rather than variance.Boosting
Change the algorithmA tree, a linear model, and a neural net have different inductive biases, hence decorrelated errors.Heterogeneous ensembles, stacking
Change the randomnessDifferent seeds, init, augmentation, or checkpoints of one training run.Deep ensembles, snapshot ensembles
NUMBER OF MODELS n → ENSEMBLE VARIANCE ρ = 0.9 → floor at 0.9 σ² ρ = 0.5 → floor at 0.5 σ² ρ = 0 → variance → 0

σ²(ρ + (1−ρ)/n) versus n. The curve you are on is set by ρ; n only walks you down to its floor.

04 · CombiningVote, average, or learn the combination

For regression, average; for classification, vote on labels or (better) average predicted probabilities, which preserves calibrated uncertainty. Stacking goes one step further: treat member predictions as features and train a small meta-model — typically regularised logistic regression — to combine them. The one rule that matters: the meta-model must be trained on out-of-fold predictions, otherwise it learns to trust whichever member overfit hardest, and the stack overfits at the second level.

The two great families are worth keeping mentally orthogonal: bagging combines deep, low-bias, high-variance learners in parallel to cancel variance; boosting combines shallow, high-bias learners in sequence to grind down bias. Same word "ensemble," opposite diagnoses.

05 · The billWhat ensembling costs

Mental Model