General ML

Bagging

Manufacture many datasets, average away the wobble

01 · First principlesThe problem: one dataset, one wobble

Some learners are good on average but unstable: train a fully grown decision tree on your data, then on a slightly perturbed copy, and you get two very different trees. In bias–variance language they have low bias and high variance — the average over hypothetical retrainings is accurate, but you only ever hold one draw from that distribution.

If we could train on many independent datasets and average the resulting models, the wobble would cancel and the accurate average would emerge. The problem forcing bagging to exist: we have exactly one dataset.

02 · The trickBootstrap: pseudo-datasets from one dataset

The bootstrap manufactures the missing datasets. Resample n points from your n-point dataset with replacement: some points appear twice or thrice, others not at all. Each resample is a plausible alternative dataset drawn from approximately the same distribution. Bootstrap aggregating:

Draw B bootstrap samples D₁ … D_B from the training set.
Train one high-variance model f̂_b on each (deep trees, no pruning — keep the bias low and let the variance run).
Predict by averaging (regression) or majority vote / averaged probabilities (classification).

f̂_bag(x) = (1/B) Σ_b=1..B f̂_b(x) · each point omitted from a given resample w.p. (1 − 1/n)ⁿ ≈ e⁻¹ ≈ 0.37

Each bootstrap model overfits its own resample's noise; the noises differ, so the average keeps the signal.

03 · What it buysPure variance reduction, bias untouched

Averaging does not move the centre of the distribution it averages: E[f̂_bag] is essentially E[f̂] (bootstrap draws are a slightly degraded stand-in for fresh data, but to first order the bias is unchanged). What averaging does move is the spread. By the variance-of-the-mean identity (derived properly in ensembles):

Var(f̂_bag) = ρσ² + (1 − ρ)σ²/B

shared overfitting survivesindependent overfitting dies

Two consequences fall out. First, bagging only helps unstable learners: bagging a linear regression (low σ², high ρ across resamples) achieves nearly nothing, while bagging deep trees is transformative. Second, the choice of base learner is deliberate: use low-bias, high-variance members, because bagging fixes only the second disease. Boosting is the mirror-image bet for the first.

Free lunch, small but real: each model never saw ~37% of the data — its out-of-bag points. Score every training point using only the models that did not train on it and you get an honest generalisation estimate with no held-out set and no extra training. In practice OOB error tracks cross-validation closely.

04 · The upgradeRandom forests: decorrelate the trees

Bagged trees share a defect: they are all built from the same features by the same greedy criterion. If one feature is strongly predictive, every tree splits on it at the root, and the trees come out similar — ρ stays high, and the ρσ² term puts a floor under the ensemble that more trees cannot break.

Random forests attack ρ directly: at each split, the tree may only consider a random subset of features (√d is the classification default). The dominant feature is unavailable for many splits, so different trees are forced to discover different structure, and their errors decorrelate. Each individual tree gets slightly worse (σ² up a little, bias up a little); the ensemble gets better, because in σ²(ρ + (1−ρ)/B) the drop in ρ outweighs both.

Method	Diversity source	What falls
Single deep tree	—	nothing (high variance)
Bagging	bootstrap resampling	(1−ρ)σ²/B term
Random forest	resampling + per-split feature subsets	the ρσ² floor itself

05 · In practiceKnobs and habits

More trees never hurt accuracy — variance only falls with B — they just cost compute. Use enough that OOB error has flattened (hundreds, typically).
The feature-subset size is the main hyperparameter: smaller → less correlation but weaker trees. The defaults (√d, or d/3 for regression) are rarely worth fighting.
Trees stay deep and unpruned — bagging wants low-bias members and will handle the variance itself.
Random forests remain the strongest "no tuning, no scaling, works Monday morning" baseline for tabular data; gradient boosting usually edges them out after tuning.

Mental Model

Bagging simulates "many training sets" with bootstrap resampling, trains one unstable model per resample, and averages.
It is pure variance reduction: the bias of the base learner is untouched, so use deep low-bias trees.
Each model misses ~37% of the data — out-of-bag points give a free, honest validation score.
Plain bagged trees stay correlated; random forests add per-split feature subsets to push ρ down, which is where the real gain lives.
Opposite of boosting: parallel, variance-killing, hard to overfit by adding members.