General ML

Decision Trees

Twenty questions, played greedily against the data

01 · First principlesWhat structure does a tree exploit?

A tree bets that the input space can be carved into rectangles, each with a roughly constant label — that the data's structure is axis-aligned and threshold-like. "Income above 50k and age below 30" is a rectangle. This is a natural fit for tabular data, where individual features genuinely carry threshold meaning (a credit score of 700 means something by itself), and a poor fit for pixels, where no single coordinate means anything.

The model is a sequence of yes/no questions of the form xj ≤ t. Each question splits the current region in two; leaves predict the majority class (or mean) of the training points that land there. Prediction is a walk from root to leaf — a game of twenty questions where the questions were chosen during training.

02 · The algorithmGreedy splits by impurity drop

Finding the globally optimal tree is NP-hard, so we grow greedily: at each node, try every feature j and threshold t, and keep the split that most purifies the children. Purity is measured per node over class proportions pc:

Gini:  G = 1 − Σc pc²   (probability two random draws disagree)
Entropy:  H = −Σc pc log₂ pc   (bits of label uncertainty left)

A split is scored by the impurity drop ΔI = I(parent) − weighted mean of I(children); choosing by entropy makes ΔI the information gain of the question. Gini and entropy almost always pick the same splits (both are concave, peaked at uniform p), so the choice between them is a non-issue in practice. Recurse until leaves are pure or too small.

Greedy means myopic: the tree never looks two questions ahead, so an XOR-like pattern — where no single split helps but a pair would — can stall it. Mostly tolerable; worth knowing.

03 · The failureFully grown trees memorise

Let the recursion run to the end and every leaf becomes pure — often containing a single training point. Training error hits zero, and the tree has become a lookup table of the training set, noise included. A handful of changed points near the root cascades into a completely different tree below. This is textbook high variance: the model family is expressive enough (low bias), but a particular draw of the data swings the fit wildly.

Two escapes exist, and they define two branches of the field:

Prune · accept some bias
Limit depth, require a minimum leaf size, or grow fully and prune back by cost-complexity (penalise leaves at rate α, chosen by cross-validation). One small, readable tree; you pay in accuracy.
Ensemble · average the variance away
Keep the trees deep and unstable, but grow many and combine them: bagging / random forests average independent wobbles down; boosting instead stacks weak shallow trees to grind bias down. The instability of a single tree is precisely what makes it the ideal ensemble ingredient (see ensembles).

04 · What trees uniquely offerThe practical virtues

05 · The geometric weaknessDiagonals become staircases

Every split is perpendicular to an axis, so every decision boundary a tree can draw is a union of axis-aligned rectangle edges. When the true boundary is a diagonal line — trivial for a linear model — the tree must approximate it with a staircase, and each stair step is another split consuming another slice of data. Rotating your features by 45° can turn an easy problem into a hard one, which is a strange property for a learner to have.

TRUE BOUNDARY (DASHED) VS WHAT A TREE CAN DRAW each stair = one more split = more data spent

Axis-aligned splits can only build staircases. The finer the staircase, the deeper the tree and the higher the variance.

Mental Model