General ML

No Free Lunch Theorem

Why no learner wins everywhere, and why learning works regardless

01 · The claimAveraged over all worlds, everyone ties

The first-principles question: is there a universally best learning algorithm — one that beats the others no matter what function generates the data? Wolpert's answer is a flat no, and stronger than intuition expects:

No Free Lunch: averaged uniformly over all possible target functions, every learning algorithm has exactly the same off-training-set generalisation performance. Gradient boosting, a deep network, and "predict by hashing the input" tie. So does predicting the opposite of your model.

Note the precise scope: performance on points outside the training set, averaged over every function from inputs to labels, each weighted equally. Both italicised phrases carry the whole theorem.

02 · Why it must be trueA random world has no pattern to exploit

The proof idea fits in a paragraph. Fix any training set you like. For every target function that agrees with your training data and continues one way off it, there is another function agreeing with the same training data and continuing every other way — and uniform averaging weights them identically. Whatever your algorithm predicts at an unseen point, exactly half the consistent worlds say it is right and half say it is wrong (in the binary case). The training data constrains nothing beyond itself.

E_f∼uniform[ generalisation(A, f) ] is the same for every algorithm A

In a world drawn uniformly at random, the past is literally uninformative about the future at new inputs. Learning is impossible there — not hard, impossible — and all algorithms are equal the way all lottery strategies are equal.

03 · The resolutionThe real world is not uniform over functions

Yet learning plainly works — models trained on Tuesday predict Wednesday. The theorem and the practice are reconciled by its premise: nature does not draw target functions uniformly. The functions we actually meet are an astronomically thin, heavily structured slice:

Smoothness: nearby inputs usually get nearby outputs — the silent license behind interpolation, kernels, and nearest neighbours.
Compositionality: complex things are built from reusable parts (edges → textures → objects; tokens → phrases → meaning) — what depth exploits.
Locality and invariance: a cat is a cat wherever it sits in the frame; physics does not care about your coordinate origin.

Every algorithm's inductive bias — its architecture, its regulariser, its preference among hypotheses that fit the data equally — is a bet that the world has one of these structures. NFL says the bet cannot be free: any bias that helps on structured worlds must hurt on the (unstructured) rest. Learning works because we keep playing in the casino where our bets happen to match the house.

One sentence to keep: generalisation is never paid for by data alone; it is paid for by data plus a correct assumption about the world, and NFL is the theorem that forbids skipping the second payment.

04 · ConsequencesMatch the bias to the structure

"No universal best model" is not a counsel of despair; it is a design instruction. The practical craft of ML is choosing the algorithm whose bet matches your data's structure:

Data structure	Matching inductive bias	The bet it encodes
Images	CNNs	Locality + translation invariance: nearby pixels relate; position does not change identity.
Sequences / language	Transformers, RNNs	Order matters; long-range dependencies exist and are sparse enough to attend to.
Graphs / molecules	GNNs	Information flows along edges; node identity is permutation-invariant.
Tabular, heterogeneous columns	Gradient-boosted trees	Axis-aligned splits and feature interactions; no smoothness across unrelated columns.
Small data, strong theory	Bayesian models with informative priors	The structure is known well enough to write down — see MLE vs MAP.

This is also why transformers eating every domain does not refute NFL: attention plus massive data is a broad bias (weak assumptions, high capacity) purchased with enormous sample cost — exactly the trade the bias–variance note prices.

05 · HumilityWhat the theorem does to leaderboards

A benchmark is a sample from one structured corner of function space. "SOTA on the benchmark" therefore means "this bias matches this corner", and licenses no claim about other corners — a fact rediscovered every time a leaderboard champion is deployed against slightly shifted data. NFL is the formal reason that:

benchmark gains do not transfer by right, only by shared structure;
"best model" is an incomplete sentence until you append "for data like this";
claims of a universal architecture should be read as claims about which structures are common, which is an empirical statement about the world, not a theorem about the method.

Interview-grade summary: NFL says all algorithms tie on average over all possible worlds; practice works because our world is a very particular one, and good ML is the art of encoding which particular one into the model.

Mental Model

Uniformly averaged over all target functions, every learner generalises identically — training data constrains nothing beyond itself in a random world.
Learning works because reality is not uniform: smoothness, compositionality, locality are the structure we cash in.
Inductive bias is a bet on that structure; NFL guarantees every bet loses somewhere.
Model selection = matching the bet to the data (CNN ↔ images, trees ↔ tabular).
Benchmarks measure one corner of function space; hold SOTA claims accordingly.