LLMs

Scaling Laws

How to spend a fixed compute budget, derived from a curve

01 · First principlesYou have one budget and two knobs

Suppose you are given C floating-point operations to train a language model, once. Training cost is roughly C ≈ 6ND: a model with N parameters seen over D tokens. So C buys you a position on a curve — a huge model skimming little data, or a small model grinding through a lot of it. Which point do you pick?

Before scaling laws, this was answered by taste. The discovery is that you do not need taste, because pretraining loss turns out to be a startlingly smooth, predictable function of N, D and C across many orders of magnitude. Fit the function on small cheap runs, then read off where to spend the large expensive one.

02 · The empirical lawLoss falls as a power law

Hold everything else generous and vary one resource. The cross-entropy loss follows a power law in each, down to an irreducible floor (the entropy of text itself):

L(N) ≈ L + (Nc / N)αN     L(D) ≈ L + (Dc / D)αD
entropy floor capacity term data term

A power law is a straight line on log-log axes. That straightness is the entire practical content: it means a trend measured at 107 parameters keeps holding at 1010, so small experiments genuinely forecast big ones. The exponents are small (αN ≈ 0.076, αD ≈ 0.095 in Kaplan's fits), which is why every constant-factor improvement in loss costs an order of magnitude in resources.

COMPUTE C (LOG SCALE) → LOSS (LOG SCALE) → compute-efficient frontier each grey curve = one model size entropy of text

Each grey curve is one model size trained longer. Their lower envelope — the best loss reachable at each compute level — is a straight line on log-log axes.

03 · The correctionKaplan vs Chinchilla

Kaplan et al. (2020) fit the joint law and concluded that, for a fixed budget, parameters matter much more than tokens: scale N aggressively, and data only gently. The field obeyed, and produced GPT-3-era models that were enormous and conspicuously undertrained.

Hoffmann et al. (2022, "Chinchilla") reran the experiment with one fix — Kaplan had used a fixed learning-rate schedule, which quietly penalised the longer training runs. With the schedule matched to each run length, the answer changed: N and D should scale equally, both as C0.5, landing near a simple rule of thumb.

Compute-optimal ≈ 20 tokens per parameter. A 70B model wants roughly 1.4T tokens. Chinchilla (70B, 1.4T) beat Gopher (280B, 300B tokens) on the same compute — a quarter of the parameters, four times the data.
Kaplan 2020Chinchilla 2022
Optimal N vs CN ∝ C0.73N ∝ C0.50
Optimal D vs CD ∝ C0.27D ∝ C0.50
Verdict on GPT-3-era modelsabout rightheavily undertrained
Hidden flawfixed LR scheduleschedule matched to run length

04 · The pointBudget allocation, not prophecy

It is tempting to read scaling laws as a forecast of intelligence. That is not what they are for. The law predicts pretraining loss, and the mapping from loss to downstream capability is lumpy (specific abilities can appear abruptly even while loss falls smoothly). What the law actually answers is an engineering question: given C, what (N, D) pair minimises loss? It converts a research gamble into an allocation problem, the same way a portfolio rule converts stock-picking into rebalancing.

This is also why the result matters commercially. Getting the split wrong wastes the budget by a constant factor — and at frontier scale, a constant factor is the price of a second training run.

05 · CaveatsWhere the clean story bends

Mental Model