Probability mass, probability density, and the trap between them
For a discrete variable, probability is straightforward bookkeeping. The probability mass function assigns each outcome a weight directly:
Every number you read off a PMF is a genuine probability. A die: p(3) = 1/6, full stop. Nothing here will surprise you, which is exactly why the continuous case does.
Now let X be a person's height. What is the probability that someone is exactly 170 cm — to infinitely many decimal places? Zero. There are uncountably many real numbers; if any single point carried positive probability, the total would explode past 1. So for every continuous X and every point x:
The fix is to stop asking about points and start asking about intervals. The probability density function p(x) is defined so that probability is its integral:
A density is probability per unit length — the same relationship a material's density has to its mass. Asking "what is the probability at this point" is asking "what is the mass of this point": zero. Only the integral over a region means anything. The map analogy holds up: population density at one GPS coordinate is well defined, but the population of a coordinate is zero; you must integrate over an area to count people.
Because p(x) is a rate, not a probability, nothing caps it at 1. The uniform density on [0, 0.1] equals 10 everywhere on its support; a Gaussian with σ = 0.01 peaks near 40. The constraint is only that the area equals 1 — tall is fine if narrow.
Left: each bar height is a probability. Right: heights are rates; the shaded area is the probability.
Transform Y = f(X) with f invertible, and the naive guess pY(y) = pX(f−1(y)) is wrong: it forgets that f stretches and compresses space, and density is per unit length. Probability mass in a small interval must be conserved:
The Jacobian determinant is the local volume-change factor; dividing density by stretch keeps the area under the curve equal to 1. Normalising flows are this formula made into an architecture — stacks of invertible maps whose log|det J| is cheap to compute.
A model density p(x | θ) is one object read two ways. Fix θ and vary x: it is a distribution over data, and it integrates to 1. Fix the observed data x and vary θ: it is the likelihood L(θ) = p(x | θ) — a score for parameters, which integrates to nothing in particular over θ and is not a distribution over θ at all.
Because individual continuous datapoints have probability zero, "the probability of the data" always silently means the density evaluated at the data — which is why log-likelihoods of continuous models can be positive (densities above 1), a fact that confuses everyone exactly once.