General ML

Decision Trees

Twenty questions, played greedily against the data

01 · First principlesWhat structure does a tree exploit?

A tree bets that the input space can be carved into rectangles, each with a roughly constant label — that the data's structure is axis-aligned and threshold-like. "Income above 50k and age below 30" is a rectangle. This is a natural fit for tabular data, where individual features genuinely carry threshold meaning (a credit score of 700 means something by itself), and a poor fit for pixels, where no single coordinate means anything.

The model is a sequence of yes/no questions of the form x_j ≤ t. Each question splits the current region in two; leaves predict the majority class (or mean) of the training points that land there. Prediction is a walk from root to leaf — a game of twenty questions where the questions were chosen during training.

02 · The algorithmGreedy splits by impurity drop

Finding the globally optimal tree is NP-hard, so we grow greedily: at each node, try every feature j and threshold t, and keep the split that most purifies the children. Purity is measured per node over class proportions p_c:

Gini: G = 1 − Σ_c p_c² (probability two random draws disagree)

Entropy: H = −Σ_c p_c log₂ p_c (bits of label uncertainty left)

A split is scored by the impurity drop ΔI = I(parent) − weighted mean of I(children); choosing by entropy makes ΔI the information gain of the question. Gini and entropy almost always pick the same splits (both are concave, peaked at uniform p), so the choice between them is a non-issue in practice. Recurse until leaves are pure or too small.

Greedy means myopic: the tree never looks two questions ahead, so an XOR-like pattern — where no single split helps but a pair would — can stall it. Mostly tolerable; worth knowing.

03 · The failureFully grown trees memorise

Let the recursion run to the end and every leaf becomes pure — often containing a single training point. Training error hits zero, and the tree has become a lookup table of the training set, noise included. A handful of changed points near the root cascades into a completely different tree below. This is textbook high variance: the model family is expressive enough (low bias), but a particular draw of the data swings the fit wildly.

Two escapes exist, and they define two branches of the field:

Prune · accept some bias

Limit depth, require a minimum leaf size, or grow fully and prune back by cost-complexity (penalise leaves at rate α, chosen by cross-validation). One small, readable tree; you pay in accuracy.

Ensemble · average the variance away

Keep the trees deep and unstable, but grow many and combine them: bagging / random forests average independent wobbles down; boosting instead stacks weak shallow trees to grind bias down. The instability of a single tree is precisely what makes it the ideal ensemble ingredient (see ensembles).

04 · What trees uniquely offerThe practical virtues

No scaling needed. Splits compare a feature only against its own threshold; monotone transformations of a feature change nothing. The preprocessing that is life-and-death for k-NN and SVMs is simply irrelevant.
Mixed types, missing values. Numeric, ordinal, and categorical features coexist without encoding gymnastics; surrogate splits handle missing entries.
Built-in feature selection. Irrelevant features are never split on, so they cost nothing — the exact opposite of distance-based methods, which they poison.
Interpretability of a single tree. The model is an explanation: a root-to-leaf path is a human-readable rule. (This virtue dies the moment you ensemble; a forest of 500 trees explains nothing directly.)

05 · The geometric weaknessDiagonals become staircases

Every split is perpendicular to an axis, so every decision boundary a tree can draw is a union of axis-aligned rectangle edges. When the true boundary is a diagonal line — trivial for a linear model — the tree must approximate it with a staircase, and each stair step is another split consuming another slice of data. Rotating your features by 45° can turn an easy problem into a hard one, which is a strange property for a learner to have.

Axis-aligned splits can only build staircases. The finer the staircase, the deeper the tree and the higher the variance.

Mental Model

A tree is twenty questions: greedy axis-aligned splits, each chosen to maximise impurity drop (Gini or entropy — same splits in practice).
Grown fully, it is a lookup table of the training set: low bias, ruinous variance.
Two escapes: prune one tree (interpretable, weaker) or ensemble many (accurate, opaque) — bagging fights the variance, boosting fights the bias.
Unique virtues: scale-free, mixed types, ignores irrelevant features, and a single tree reads as rules.
Geometric tax: diagonal truths become staircases; the tree pays data for every stair.