Why no learner wins everywhere, and why learning works regardless
The first-principles question: is there a universally best learning algorithm — one that beats the others no matter what function generates the data? Wolpert's answer is a flat no, and stronger than intuition expects:
Note the precise scope: performance on points outside the training set, averaged over every function from inputs to labels, each weighted equally. Both italicised phrases carry the whole theorem.
The proof idea fits in a paragraph. Fix any training set you like. For every target function that agrees with your training data and continues one way off it, there is another function agreeing with the same training data and continuing every other way — and uniform averaging weights them identically. Whatever your algorithm predicts at an unseen point, exactly half the consistent worlds say it is right and half say it is wrong (in the binary case). The training data constrains nothing beyond itself.
In a world drawn uniformly at random, the past is literally uninformative about the future at new inputs. Learning is impossible there — not hard, impossible — and all algorithms are equal the way all lottery strategies are equal.
Yet learning plainly works — models trained on Tuesday predict Wednesday. The theorem and the practice are reconciled by its premise: nature does not draw target functions uniformly. The functions we actually meet are an astronomically thin, heavily structured slice:
Every algorithm's inductive bias — its architecture, its regulariser, its preference among hypotheses that fit the data equally — is a bet that the world has one of these structures. NFL says the bet cannot be free: any bias that helps on structured worlds must hurt on the (unstructured) rest. Learning works because we keep playing in the casino where our bets happen to match the house.
"No universal best model" is not a counsel of despair; it is a design instruction. The practical craft of ML is choosing the algorithm whose bet matches your data's structure:
| Data structure | Matching inductive bias | The bet it encodes |
|---|---|---|
| Images | CNNs | Locality + translation invariance: nearby pixels relate; position does not change identity. |
| Sequences / language | Transformers, RNNs | Order matters; long-range dependencies exist and are sparse enough to attend to. |
| Graphs / molecules | GNNs | Information flows along edges; node identity is permutation-invariant. |
| Tabular, heterogeneous columns | Gradient-boosted trees | Axis-aligned splits and feature interactions; no smoothness across unrelated columns. |
| Small data, strong theory | Bayesian models with informative priors | The structure is known well enough to write down — see MLE vs MAP. |
This is also why transformers eating every domain does not refute NFL: attention plus massive data is a broad bias (weak assumptions, high capacity) purchased with enormous sample cost — exactly the trade the bias–variance note prices.
A benchmark is a sample from one structured corner of function space. "SOTA on the benchmark" therefore means "this bias matches this corner", and licenses no claim about other corners — a fact rediscovered every time a leaderboard champion is deployed against slightly shifted data. NFL is the formal reason that: