LLMs

Scaling Laws

How to spend a fixed compute budget, derived from a curve

01 · First principlesYou have one budget and two knobs

Suppose you are given C floating-point operations to train a language model, once. Training cost is roughly C ≈ 6ND: a model with N parameters seen over D tokens. So C buys you a position on a curve — a huge model skimming little data, or a small model grinding through a lot of it. Which point do you pick?

Before scaling laws, this was answered by taste. The discovery is that you do not need taste, because pretraining loss turns out to be a startlingly smooth, predictable function of N, D and C across many orders of magnitude. Fit the function on small cheap runs, then read off where to spend the large expensive one.

02 · The empirical lawLoss falls as a power law

Hold everything else generous and vary one resource. The cross-entropy loss follows a power law in each, down to an irreducible floor (the entropy of text itself):

L(N) ≈ L_∞ + (N_c / N)^α_N L(D) ≈ L_∞ + (D_c / D)^α_D

entropy floor capacity term data term

A power law is a straight line on log-log axes. That straightness is the entire practical content: it means a trend measured at 10⁷ parameters keeps holding at 10¹⁰, so small experiments genuinely forecast big ones. The exponents are small (α_N ≈ 0.076, α_D ≈ 0.095 in Kaplan's fits), which is why every constant-factor improvement in loss costs an order of magnitude in resources.

Each grey curve is one model size trained longer. Their lower envelope — the best loss reachable at each compute level — is a straight line on log-log axes.

03 · The correctionKaplan vs Chinchilla

Kaplan et al. (2020) fit the joint law and concluded that, for a fixed budget, parameters matter much more than tokens: scale N aggressively, and data only gently. The field obeyed, and produced GPT-3-era models that were enormous and conspicuously undertrained.

Hoffmann et al. (2022, "Chinchilla") reran the experiment with one fix — Kaplan had used a fixed learning-rate schedule, which quietly penalised the longer training runs. With the schedule matched to each run length, the answer changed: N and D should scale equally, both as C^0.5, landing near a simple rule of thumb.

Compute-optimal ≈ 20 tokens per parameter. A 70B model wants roughly 1.4T tokens. Chinchilla (70B, 1.4T) beat Gopher (280B, 300B tokens) on the same compute — a quarter of the parameters, four times the data.

	Kaplan 2020	Chinchilla 2022
Optimal N vs C	N ∝ C^0.73	N ∝ C^0.50
Optimal D vs C	D ∝ C^0.27	D ∝ C^0.50
Verdict on GPT-3-era models	about right	heavily undertrained
Hidden flaw	fixed LR schedule	schedule matched to run length

04 · The pointBudget allocation, not prophecy

It is tempting to read scaling laws as a forecast of intelligence. That is not what they are for. The law predicts pretraining loss, and the mapping from loss to downstream capability is lumpy (specific abilities can appear abruptly even while loss falls smoothly). What the law actually answers is an engineering question: given C, what (N, D) pair minimises loss? It converts a research gamble into an allocation problem, the same way a portfolio rule converts stock-picking into rebalancing.

This is also why the result matters commercially. Getting the split wrong wastes the budget by a constant factor — and at frontier scale, a constant factor is the price of a second training run.

05 · CaveatsWhere the clean story bends

Data quality shifts the curve. The constants (not the exponents) depend on the corpus. Aggressive filtering and deduplication move the whole line down; a token of curated text is worth several tokens of raw crawl. See pretraining.
Data can run out. Chinchilla assumes unlimited fresh tokens. When the corpus is finite, repeating data gives diminishing (though nonzero, for a few epochs) returns, and the optimum slides back toward larger N.
Inference is not free. Chinchilla optimises training compute only. If a model will serve billions of requests, every parameter is paid for again at inference, forever. Inference-aware analysis says to overtrain: pick N well below compute-optimal and push D far past 20 tokens/param (Llama-style models train at 100–1000+ tokens/param). For decoupling parameters from inference cost more directly, see mixture of experts.
Loss is a proxy. Equal-loss models can differ on capabilities you care about; the law is silent on which.

Mental Model

Loss is a smooth power law in N, D, C: a straight line on log-log axes, which is why small runs forecast big ones.
Chinchilla: scale parameters and tokens together, ≈ 20 tokens per parameter at the training-compute optimum.
Kaplan's "parameters über alles" came from a learning-rate schedule artifact — a reminder that scaling fits inherit their experiments' bugs.
If you serve the model, deliberately overtrain a smaller one; inference cost bends the optimum away from Chinchilla.
The law allocates budget; it does not promise capabilities.