LLMs

Pretraining

One dumb task, scaled until it stops being dumb

01 · First principlesWhy next-token prediction, of all tasks?

Supervised learning needs labels, and labels are the expensive part. The first-principles question of pretraining is: what task has effectively infinite free labels and cannot be solved without understanding? Next-token prediction is the answer to both clauses at once. Every position in every document is a labelled example (the label is simply the next token), so the supervision is the corpus itself. And the task is unboundedly hard, because predicting the next token well recruits everything:

"The cat sat on the ___" — syntax and collocation.
"The capital of Mongolia is ___" — world knowledge.
"so x = 12, and therefore 3x + 1 = ___" — computation.
"and the murderer turned out to be ___" — having tracked an entire plot.

There is no ceiling at which the task is solved and learning stops; pushing loss lower keeps demanding more structure. That open-endedness, not any architectural insight, is the core bet of the LLM era.

02 · The objectiveCross-entropy is compression

Training minimises the average negative log-probability assigned to the true next token:

L = −(1/T) Σ_t log p_θ(x_t | x_<t) = bits per token to encode the corpus

via arithmetic coding, exactly — not metaphorically

The equality on the right is literal: a model assigning probability p to the true token can encode it in −log₂ p bits with an arithmetic coder, so loss is the compressed size of the data. This reframing earns its keep twice. It explains why the task forces understanding (the best compressor of text must exploit grammar, facts and logic — regularity is the only thing compression can use), and it gives loss an absolute meaning: the gap between your loss and the true entropy of text is exactly the regularity not yet captured. A model that predicts is a model that compresses, and vice versa.

03 · What actually mattersData over architecture

The uncomfortable empirical finding of the last few years: at fixed compute, architecture tweaks move the loss curve by slivers, while data curation moves it by chunks. Two interventions dominate.

Deduplication. Web crawls repeat themselves enormously, and a duplicated document is a wasted gradient step that also teaches memorisation (near-duplicates are worse, being harder to catch). Removing them improves loss-per-compute and reduces verbatim regurgitation — one of the few free lunches on record.

Mixture weights. The corpus is a portfolio — web text, books, code, papers, multiple languages — and the weights are capability decisions, not hygiene. Code-heavy mixtures measurably improve reasoning; multilingual share sets multilingual quality (compounding the tokeniser inequities from tokenisation). Quality filtering, increasingly done by model-based classifiers rather than heuristics, decides what enters at all. None of this shows up in the architecture diagram, and all of it shows up in the model.

Rule of thumb: teams iterate on data because, per unit of effort, it is where the loss is. The transformer recipe has been near-frozen since 2020 (pre-norm, RoPE, SwiGLU, and see MoE); the data pipeline has not stopped changing.

04 · The regimeCompute-bound, and what the loss curve says

Pretraining lives in a regime most of ML never visits: data is effectively unlimited (each token is seen about once), so there is no overfitting in the classical sense, and the binding constraint is compute. The train and validation curves nearly coincide; the question is never "will it generalise" but "how far down can the budget push the curve" — which is precisely the question scaling laws answer, prescribing ~20 tokens per parameter at the compute optimum (more if inference matters).

The canonical curve: a fast drop as the model learns token statistics, then a long power-law slide where each increment of quality costs multiplicatively more tokens. Spikes are watched obsessively; a diverged run at this scale is unrecoverable money.

05 · The boundaryWhat pretraining does not produce

The output of pretraining is a simulator of its corpus, not an assistant. Ask it a question and it may continue with three more questions, because that is a likely continuation of text containing a question. Capabilities are present but not elicited — getting from "predicts documents" to "answers helpfully" is the job of finetuning and RLHF, which reshape behaviour while adding almost no new knowledge. Nearly everything the final model knows, it learned here.

Mental Model

Next-token prediction is the one task with infinite free labels and no ceiling — predicting well eventually requires syntax, knowledge and reasoning.
Cross-entropy loss is literally bits-per-token: training a predictor and building a compressor are the same act.
At fixed compute, dedup and mixture weights move loss more than architecture changes; data is where iteration pays.
The regime is compute-bound with single-epoch data: no classical overfitting, scaling laws set the budget split.
Pretraining stores the knowledge; everything afterwards only shapes how it comes out.