The long-run average, and why every loss function is one
A random variable X takes many values with many probabilities. If you could only keep one number to stand for it, which one? The natural candidate: weight each value by how often it occurs.
The probability-weighted average is also the long-run value: play the game n times, average the outcomes, and as n grows the empirical average converges to E[X] (the law of large numbers). A casino does not know what your next roll pays; it knows exactly what a million rolls pay per roll. Expectation is the casino's view of randomness.
Almost every property in probability comes with conditions. Linearity of expectation comes with none:
This is the fact everyone forgets. Variances only add under independence; expectations add unconditionally, because integration is linear and dependence lives in the joint density, which the sum never has to inspect. We can verify in two lines:
The trick in practice: decompose a hard random quantity into a sum of easy indicator variables, take expectations one by one, and never once think about how the pieces interact. (Counting expected collisions in a hash table, expected triangles in a random graph — all the same move.)
What we actually want to minimise in ML is the risk — expected loss over the true data distribution:
We cannot compute this expectation (we do not have pdata), so we estimate it with samples:
That is the entire justification for training on a dataset, in one line. A mini-batch gradient is the same move applied to a gradient: an unbiased Monte Carlo estimate of ∇θE[ℓ]. SGD works because the expectation of the noisy gradient is the true gradient — linearity again, doing quiet load-bearing work.
Suppose you observe X and must predict Y with some function g(X), scored by mean squared error. Which g is optimal? Not a modelling choice — a theorem:
The key step: fix x, write c = g(x), and expand around the conditional mean μ = E[Y|x]:
So every regression model is an attempt to approximate E[Y|X], and the leftover Var(Y|x) is exactly the noise floor in the bias–variance decomposition. (Swap MSE for absolute error and the answer becomes the conditional median; the loss chooses the summary.)
Two standard failure modes. First, for skewed distributions the mean sits far from where the mass is — income, token frequencies, loss spikes. Second, some distributions have no expectation at all: the Cauchy integral ∫ x·p(x) dx diverges, and sample averages never settle, no matter how many samples you take.
A right-skewed density. The mean is dragged toward the tail; most samples land left of it.