General ML

Entropy

Surprise, averaged — and why your training loss is a code length

01 · First principlesHow surprised should an event make you?

Start with one event of probability p. We want a number s(p) measuring how surprising it is when it happens. Three requirements pin the answer down almost uniquely:

  1. Certain events carry no surprise: s(1) = 0.
  2. Rarer means more surprising: s decreases in p, blowing up as p → 0.
  3. Independent surprises add: two unrelated events of probabilities p and q occur together with probability pq, and the joint surprise should be s(p) + s(q).

Requirement 3 is the decisive one — a function turning products into sums is a logarithm. So:

s(p) = −log p   — base 2 gives bits, base e gives nats

A fair-coin head is 1 bit of surprise. A 1-in-1024 event is 10 bits. Doubling the rarity adds one bit; surprise is the receipt randomness hands you, denominated in logs.

02 · The definitionEntropy = expected surprise

A distribution produces events all the time, each with its own surprise. Average them under the distribution itself (an expectation, like everything else):

H(p) = Ex∼p[−log p(x)] = −Σx p(x) log p(x)
how surprising this source is, on average

Shannon's theorem gives the operational meaning in one line: H(p) is the minimum average number of bits per symbol any code can achieve for data drawn from p. Frequent symbols get short codewords, rare ones get long codewords (length −log p(x) each), and no cleverness beats the average. Entropy is not a metaphor for information; it is the irreducible invoice for transmitting it.

03 · The extremesUniform maximises, deterministic zeroes

A deterministic source — one outcome, probability 1 — has H = 0: nothing to say, nothing to transmit. At the other end, the uniform distribution over K outcomes maximises entropy at log K: every guess is as bad as every other, maximal ignorance. Everything interesting lives between.

P(HEADS) → 0 0.5 1 H · BITS 1 BIT FAIR COIN · MAX UNCERTAINTY LOADED → PREDICTABLE → H → 0

Binary entropy. Certain at either end (H = 0), maximally uncertain at p = 0.5 (H = 1 bit).

This curve is why label smoothing, exploration bonuses, and entropy regularisation all reach for the same dial: pushing a policy or a softmax toward the top of the curve keeps options open, pushing toward the ends commits.

04 · The wrong codebookCross-entropy is your training loss

Suppose data comes from p but you built your code (your model) for q. You pay q's codeword lengths, −log q(x), at p's frequencies:

H(p, q) = Ex∼p[−log q(x)] = H(p) + KL(p ∥ q)
unavoidablepenalty for the wrong book

This is exactly the loss a classifier trains on: the labels are samples from p (usually one-hot), the softmax output is q, and minimising −log q(label) is minimising the average code length your model assigns the truth. Since H(p) is fixed by the data, minimising cross-entropy is minimising the KL divergence — the wrong-book penalty is the only part you can reduce.

Perplexity, the language-modelling metric, is just exp(H(p, q)) per token: the effective number of equally likely choices the model is still hedging between. A perplexity of 20 means the model is, on average, as uncertain as a fair 20-sided die.

05 · OrientationWhere each quantity points

QuantityQuestion it answersLives in
−log p(x)How surprising was this one event?Per-sample loss values
H(p)How unpredictable is this source, irreducibly?Noise floor; best achievable loss
H(p, q)What do I pay using model q on data p?The training objective
KL(p ∥ q)How much of that payment was avoidable?Its own note
Mental Model