Weak learners in sequence, each fixing what the last got wrong
01 · First principlesThe opposite bet to bagging
Bagging starts with strong, unstable learners and averages their wobble away — a variance cure. Boosting makes the opposite bet: start with learners that are weak on purpose — decision stumps or depth-3 trees, each barely better than chance, all bias, almost no variance — and ask whether many of them can be composed into a strong learner. (That this is possible at all was a theoretical surprise: Schapire's 1990 answer to a question of Kearns.)
The composition cannot be a plain average; averaging identical weaklings gives a calm weakling. The members must be specialised, and boosting specialises them by training sequentially: each new learner is fit not to the data, but to whatever the current ensemble still gets wrong. Diversity is manufactured by construction, not by resampling (the contrast that organises all of ensembles).
02 · The unifying viewGradient descent in function space
The clean way to see all boosting at once. We are building an additive model FM(x) = Σ η hm(x) to minimise a loss Σi L(yi, F(xi)). Treat the vector of current predictions (F(x1), …, F(xn)) as the parameters and take a gradient step — except the "step" must be a function we can evaluate on new points, so we fit a weak learner to the negative gradient:
ri = −∂L(yi, F(xi)) / ∂F(xi) → hm ≈ fit to {(xi, ri)} → Fm = Fm−1 + η hm
pseudo-residualslearning rate
This is functional gradient descent: each tree is one gradient step, approximated within the span of weak learners. Special cases drop out immediately. With squared loss, ri = yi − F(xi): each tree literally fits the residuals. With log-loss you get modern gradient-boosted classification. With exponential loss you recover AdaBoost, whose famous example reweighting is just this gradient in disguise. One mechanism, many costumes.
Each weak tree adds a coarse correction where the current fit is most wrong; the sum sharpens toward the dashed truth.
03 · The knobsShrinkage, depth, and stopping
Learning rate η (shrinkage): scale each tree's contribution down (0.01–0.1 typical). Small steps mean each tree corrects gently and later trees re-examine the same errors, which regularises strongly. The robust empirical law: lower η + more trees ≥ higher η + fewer trees, paid in compute.
Why depth stays small: a depth-k tree can express k-way feature interactions; most real signal lives in low-order interactions, and shallow trees are exactly the high-bias weak learners the theory wants. Depth 3–8 is the working range. Deep trees here would chase residual noise — the residuals shrink as boosting proceeds, until what remains is mostly noise.
Early stopping: that last clause is the failure mode. Boosting reduces bias relentlessly, and given enough rounds it will fit the noise too — unlike bagging, where more members never hurt. Monitor a validation set and stop when its loss turns; this is the one non-negotiable safeguard.
04 · ContrastBoosting vs bagging, side by side
Bagging / random forest
Parallel. Strong, deep, low-bias members; independence via bootstrap + feature subsets; averaging kills variance; more trees never overfit; embarrassingly parallel to train; one mediocre default that almost always works.
Boosting
Sequential. Weak, shallow, high-bias members; each fits the predecessors' pseudo-residuals; addition kills bias; more rounds eventually overfit (early stopping required); inherently sequential; higher ceiling, more knobs.
Diagnosis decides: if single deep trees already overfit your data, bag them. If even your best single model underfits, boost. The two are answers to opposite halves of the bias–variance decomposition.
05 · Engineering apexXGBoost and LightGBM
The dominance of boosting on tabular benchmarks is half theory, half engineering. XGBoost added second-order (Newton) steps — using the Hessian of the loss as well as the gradient — explicit L1/L2 regularisation on leaf weights, principled handling of missing values, and column subsampling borrowed straight from random forests (the methods converge in practice). LightGBM made it fast at scale: histogram-binned split finding and leaf-wise growth that spends depth only where the loss says it pays.
The honest summary of a decade of Kaggle and industry tabular ML: tuned gradient-boosted trees are the strongest general-purpose tabular model, with random forests as the no-tuning fallback and deep learning rarely worth the trouble below millions of rows.
Mental Model
Boosting builds an additive model sequentially: each weak learner fits the negative gradient of the loss at the current predictions.
It is gradient descent in function space; AdaBoost, residual fitting, and log-loss boosting are one algorithm with different losses.
Bias falls round by round; variance creeps up — the mirror image of bagging — so early stopping is mandatory.
Keep trees shallow (low-order interactions) and η small (gentle steps); buy accuracy with more rounds.
XGBoost/LightGBM = this idea + Newton steps + regularisation + systems engineering; still the tabular champion.