One dumb task, scaled until it stops being dumb
Supervised learning needs labels, and labels are the expensive part. The first-principles question of pretraining is: what task has effectively infinite free labels and cannot be solved without understanding? Next-token prediction is the answer to both clauses at once. Every position in every document is a labelled example (the label is simply the next token), so the supervision is the corpus itself. And the task is unboundedly hard, because predicting the next token well recruits everything:
There is no ceiling at which the task is solved and learning stops; pushing loss lower keeps demanding more structure. That open-endedness, not any architectural insight, is the core bet of the LLM era.
Training minimises the average negative log-probability assigned to the true next token:
The equality on the right is literal: a model assigning probability p to the true token can encode it in −log₂ p bits with an arithmetic coder, so loss is the compressed size of the data. This reframing earns its keep twice. It explains why the task forces understanding (the best compressor of text must exploit grammar, facts and logic — regularity is the only thing compression can use), and it gives loss an absolute meaning: the gap between your loss and the true entropy of text is exactly the regularity not yet captured. A model that predicts is a model that compresses, and vice versa.
The uncomfortable empirical finding of the last few years: at fixed compute, architecture tweaks move the loss curve by slivers, while data curation moves it by chunks. Two interventions dominate.
Deduplication. Web crawls repeat themselves enormously, and a duplicated document is a wasted gradient step that also teaches memorisation (near-duplicates are worse, being harder to catch). Removing them improves loss-per-compute and reduces verbatim regurgitation — one of the few free lunches on record.
Mixture weights. The corpus is a portfolio — web text, books, code, papers, multiple languages — and the weights are capability decisions, not hygiene. Code-heavy mixtures measurably improve reasoning; multilingual share sets multilingual quality (compounding the tokeniser inequities from tokenisation). Quality filtering, increasingly done by model-based classifiers rather than heuristics, decides what enters at all. None of this shows up in the architecture diagram, and all of it shows up in the model.
Pretraining lives in a regime most of ML never visits: data is effectively unlimited (each token is seen about once), so there is no overfitting in the classical sense, and the binding constraint is compute. The train and validation curves nearly coincide; the question is never "will it generalise" but "how far down can the budget push the curve" — which is precisely the question scaling laws answer, prescribing ~20 tokens per parameter at the compute optimum (more if inference matters).
The canonical curve: a fast drop as the model learns token statistics, then a long power-law slide where each increment of quality costs multiplicatively more tokens. Spikes are watched obsessively; a diverged run at this scale is unrecoverable money.
The output of pretraining is a simulator of its corpus, not an assistant. Ask it a question and it may continue with three more questions, because that is a likely continuation of text containing a question. Capabilities are present but not elicited — getting from "predicts documents" to "answers helpfully" is the job of finetuning and RLHF, which reshape behaviour while adding almost no new knowledge. Nearly everything the final model knows, it learned here.