General ML

PDF / PMF

Probability mass, probability density, and the trap between them

01 · The easy caseDiscrete: probability sits on points

For a discrete variable, probability is straightforward bookkeeping. The probability mass function assigns each outcome a weight directly:

p(x) = P(X = x), p(x) ≥ 0, Σ_x p(x) = 1

Every number you read off a PMF is a genuine probability. A die: p(3) = 1/6, full stop. Nothing here will surprise you, which is exactly why the continuous case does.

02 · The trapContinuous: P(X = x) = 0, for every x

Now let X be a person's height. What is the probability that someone is exactly 170 cm — to infinitely many decimal places? Zero. There are uncountably many real numbers; if any single point carried positive probability, the total would explode past 1. So for every continuous X and every point x:

P(X = x) = 0 — yet X must land somewhere

The fix is to stop asking about points and start asking about intervals. The probability density function p(x) is defined so that probability is its integral:

P(a ≤ X ≤ b) = ∫_a^b p(x) dx, ∫_−∞^∞ p(x) dx = 1

A density is probability per unit length — the same relationship a material's density has to its mass. Asking "what is the probability at this point" is asking "what is the mass of this point": zero. Only the integral over a region means anything. The map analogy holds up: population density at one GPS coordinate is well defined, but the population of a coordinate is zero; you must integrate over an area to count people.

03 · ConsequenceDensities can exceed 1

Because p(x) is a rate, not a probability, nothing caps it at 1. The uniform density on [0, 0.1] equals 10 everywhere on its support; a Gaussian with σ = 0.01 peaks near 40. The constraint is only that the area equals 1 — tall is fine if narrow.

Left: each bar height is a probability. Right: heights are rates; the shaded area is the probability.

04 · Change of variablesWhere the |det J| comes from

Transform Y = f(X) with f invertible, and the naive guess p_Y(y) = p_X(f⁻¹(y)) is wrong: it forgets that f stretches and compresses space, and density is per unit length. Probability mass in a small interval must be conserved:

p_Y(y)·|dy| = p_X(x)·|dx| (same mass, relabelled coordinates)
⇒ p_Y(y) = p_X(x)·|dx/dy| ⇒ p_Y(y) = p_X(x)·|det J_f⁻¹(y)| in ℝ^d

The Jacobian determinant is the local volume-change factor; dividing density by stretch keeps the area under the curve equal to 1. Normalising flows are this formula made into an architecture — stacks of invertible maps whose log|det J| is cheap to compute.

05 · The ML readingLikelihood: the same function, read sideways

A model density p(x | θ) is one object read two ways. Fix θ and vary x: it is a distribution over data, and it integrates to 1. Fix the observed data x and vary θ: it is the likelihood L(θ) = p(x | θ) — a score for parameters, which integrates to nothing in particular over θ and is not a distribution over θ at all.

Density · x varies, θ fixed

"Given these parameters, how is data distributed?" Normalised over x. Used for sampling and evaluation.

Likelihood · θ varies, x fixed

"Given this data, how plausible is each parameter?" Not normalised over θ. Used for fitting — see MLE vs MAP.

Because individual continuous datapoints have probability zero, "the probability of the data" always silently means the density evaluated at the data — which is why log-likelihoods of continuous models can be positive (densities above 1), a fact that confuses everyone exactly once.

Mental Model

PMF: heights are probabilities. PDF: heights are rates; only integrals are probabilities.
P(X = x) = 0 for every continuous x — points have no mass, intervals do.
Densities can exceed 1; the only law is total area = 1.
Under a change of variables, |det J| is the stretch factor that keeps mass conserved.
Likelihood is a density read as a function of θ — same formula, different variable held fixed.