Manufacture many datasets, average away the wobble
Some learners are good on average but unstable: train a fully grown decision tree on your data, then on a slightly perturbed copy, and you get two very different trees. In bias–variance language they have low bias and high variance — the average over hypothetical retrainings is accurate, but you only ever hold one draw from that distribution.
If we could train on many independent datasets and average the resulting models, the wobble would cancel and the accurate average would emerge. The problem forcing bagging to exist: we have exactly one dataset.
The bootstrap manufactures the missing datasets. Resample n points from your n-point dataset with replacement: some points appear twice or thrice, others not at all. Each resample is a plausible alternative dataset drawn from approximately the same distribution. Bootstrap aggregating:
Each bootstrap model overfits its own resample's noise; the noises differ, so the average keeps the signal.
Averaging does not move the centre of the distribution it averages: E[f̂bag] is essentially E[f̂] (bootstrap draws are a slightly degraded stand-in for fresh data, but to first order the bias is unchanged). What averaging does move is the spread. By the variance-of-the-mean identity (derived properly in ensembles):
Two consequences fall out. First, bagging only helps unstable learners: bagging a linear regression (low σ², high ρ across resamples) achieves nearly nothing, while bagging deep trees is transformative. Second, the choice of base learner is deliberate: use low-bias, high-variance members, because bagging fixes only the second disease. Boosting is the mirror-image bet for the first.
Bagged trees share a defect: they are all built from the same features by the same greedy criterion. If one feature is strongly predictive, every tree splits on it at the root, and the trees come out similar — ρ stays high, and the ρσ² term puts a floor under the ensemble that more trees cannot break.
Random forests attack ρ directly: at each split, the tree may only consider a random subset of features (√d is the classification default). The dominant feature is unavailable for many splits, so different trees are forced to discover different structure, and their errors decorrelate. Each individual tree gets slightly worse (σ² up a little, bias up a little); the ensemble gets better, because in σ²(ρ + (1−ρ)/B) the drop in ρ outweighs both.
| Method | Diversity source | What falls |
|---|---|---|
| Single deep tree | — | nothing (high variance) |
| Bagging | bootstrap resampling | (1−ρ)σ²/B term |
| Random forest | resampling + per-split feature subsets | the ρσ² floor itself |