General ML

Boosting

Weak learners in sequence, each fixing what the last got wrong

01 · First principlesThe opposite bet to bagging

Bagging starts with strong, unstable learners and averages their wobble away — a variance cure. Boosting makes the opposite bet: start with learners that are weak on purpose — decision stumps or depth-3 trees, each barely better than chance, all bias, almost no variance — and ask whether many of them can be composed into a strong learner. (That this is possible at all was a theoretical surprise: Schapire's 1990 answer to a question of Kearns.)

The composition cannot be a plain average; averaging identical weaklings gives a calm weakling. The members must be specialised, and boosting specialises them by training sequentially: each new learner is fit not to the data, but to whatever the current ensemble still gets wrong. Diversity is manufactured by construction, not by resampling (the contrast that organises all of ensembles).

02 · The unifying viewGradient descent in function space

The clean way to see all boosting at once. We are building an additive model FM(x) = Σ η hm(x) to minimise a loss Σi L(yi, F(xi)). Treat the vector of current predictions (F(x1), …, F(xn)) as the parameters and take a gradient step — except the "step" must be a function we can evaluate on new points, so we fit a weak learner to the negative gradient:

ri = −∂L(yi, F(xi)) / ∂F(xi)   →   hm ≈ fit to {(xi, ri)}   →   Fm = Fm−1 + η hm
pseudo-residualslearning rate

This is functional gradient descent: each tree is one gradient step, approximated within the span of weak learners. Special cases drop out immediately. With squared loss, ri = yi − F(xi): each tree literally fits the residuals. With log-loss you get modern gradient-boosted classification. With exponential loss you recover AdaBoost, whose famous example reweighting is just this gradient in disguise. One mechanism, many costumes.

F₁ → F₂ → F₃ CLIMBING TOWARD THE SIGNAL, ONE TREE PER STEP F₁ (1 stump) F₂ F₃ … F_M

Each weak tree adds a coarse correction where the current fit is most wrong; the sum sharpens toward the dashed truth.

03 · The knobsShrinkage, depth, and stopping

04 · ContrastBoosting vs bagging, side by side

Bagging / random forest
Parallel. Strong, deep, low-bias members; independence via bootstrap + feature subsets; averaging kills variance; more trees never overfit; embarrassingly parallel to train; one mediocre default that almost always works.
Boosting
Sequential. Weak, shallow, high-bias members; each fits the predecessors' pseudo-residuals; addition kills bias; more rounds eventually overfit (early stopping required); inherently sequential; higher ceiling, more knobs.
Diagnosis decides: if single deep trees already overfit your data, bag them. If even your best single model underfits, boost. The two are answers to opposite halves of the bias–variance decomposition.

05 · Engineering apexXGBoost and LightGBM

The dominance of boosting on tabular benchmarks is half theory, half engineering. XGBoost added second-order (Newton) steps — using the Hessian of the loss as well as the gradient — explicit L1/L2 regularisation on leaf weights, principled handling of missing values, and column subsampling borrowed straight from random forests (the methods converge in practice). LightGBM made it fast at scale: histogram-binned split finding and leaf-wise growth that spends depth only where the loss says it pays.

The honest summary of a decade of Kaggle and industry tabular ML: tuned gradient-boosted trees are the strongest general-purpose tabular model, with random forests as the no-tuning fallback and deep learning rarely worth the trouble below millions of rows.

Mental Model