How to spend a fixed compute budget, derived from a curve
Suppose you are given C floating-point operations to train a language model, once. Training cost is roughly C ≈ 6ND: a model with N parameters seen over D tokens. So C buys you a position on a curve — a huge model skimming little data, or a small model grinding through a lot of it. Which point do you pick?
Before scaling laws, this was answered by taste. The discovery is that you do not need taste, because pretraining loss turns out to be a startlingly smooth, predictable function of N, D and C across many orders of magnitude. Fit the function on small cheap runs, then read off where to spend the large expensive one.
Hold everything else generous and vary one resource. The cross-entropy loss follows a power law in each, down to an irreducible floor (the entropy of text itself):
A power law is a straight line on log-log axes. That straightness is the entire practical content: it means a trend measured at 107 parameters keeps holding at 1010, so small experiments genuinely forecast big ones. The exponents are small (αN ≈ 0.076, αD ≈ 0.095 in Kaplan's fits), which is why every constant-factor improvement in loss costs an order of magnitude in resources.
Each grey curve is one model size trained longer. Their lower envelope — the best loss reachable at each compute level — is a straight line on log-log axes.
Kaplan et al. (2020) fit the joint law and concluded that, for a fixed budget, parameters matter much more than tokens: scale N aggressively, and data only gently. The field obeyed, and produced GPT-3-era models that were enormous and conspicuously undertrained.
Hoffmann et al. (2022, "Chinchilla") reran the experiment with one fix — Kaplan had used a fixed learning-rate schedule, which quietly penalised the longer training runs. With the schedule matched to each run length, the answer changed: N and D should scale equally, both as C0.5, landing near a simple rule of thumb.
| Kaplan 2020 | Chinchilla 2022 | |
|---|---|---|
| Optimal N vs C | N ∝ C0.73 | N ∝ C0.50 |
| Optimal D vs C | D ∝ C0.27 | D ∝ C0.50 |
| Verdict on GPT-3-era models | about right | heavily undertrained |
| Hidden flaw | fixed LR schedule | schedule matched to run length |
It is tempting to read scaling laws as a forecast of intelligence. That is not what they are for. The law predicts pretraining loss, and the mapping from loss to downstream capability is lumpy (specific abilities can appear abruptly even while loss falls smoothly). What the law actually answers is an engineering question: given C, what (N, D) pair minimises loss? It converts a research gamble into an allocation problem, the same way a portfolio rule converts stock-picking into rebalancing.
This is also why the result matters commercially. Getting the split wrong wastes the budget by a constant factor — and at frontier scale, a constant factor is the price of a second training run.