General ML

MLE vs MAP

Trust the data alone, or hedge with a prior — and why regularisers are priors

01 · MLEPick the θ that makes the data least surprising

You have data D and a model family p(D | θ). The maximum-likelihood principle is the most natural rule available: choose the parameters under which what you actually observed was most probable.

θ̂MLE = argmaxθ p(D | θ) = argmaxθ Σi log p(xi | θ)

We take the log for two unglamorous reasons: independent likelihoods are products, and logs turn products into sums (differentiable term by term, parallelisable, the form every gradient framework expects); and a product of a million numbers below 1 underflows any float, while its log is a perfectly tame sum. The argmax is unchanged because log is monotone.

MLE has excellent asymptotic manners — consistent, efficient, and equivalent to minimising KL(data ∥ model), which is why cross-entropy training is MLE. Its failure mode is not asymptotic.

02 · The breakageThree heads, and the coin is "certainly" rigged

Flip a coin three times, observe HHH. The likelihood of heads-probability p is p³, maximised at:

θ̂MLE = 3/3 = 1   — tails is impossible, says the model, after three flips

MLE believes thin data with total conviction. It assigns probability zero to anything unseen (the smoothing problem in language models is this exact failure), and its variance explodes precisely when data is scarce — the regime where you most need an estimator you can trust. Nothing in the machinery represents the thought "three flips is not much evidence".

03 · MAPLet a prior vote too

Bring in Bayes: treat θ as uncertain, give it a prior p(θ), and maximise the posterior instead. Since the evidence does not depend on θ:

θ̂MAP = argmaxθ p(θ | D) = argmaxθ p(D | θ)·p(θ)
         = argmaxθ [ log p(D | θ) + log p(θ) ]   =  MLE + log-prior

One added term. For the coin, a mild Beta(2,2) prior ("coins are usually fair-ish") moves the estimate from 1.0 to 4/5 — still leaning heads, no longer certain. The prior is a rubber band anchored at your prior belief; the likelihood stretches it toward the data, and three flips do not stretch it far.

θ = P(HEADS) → 0 0.5 1 PRIOR Beta(2,2) LIKELIHOOD p³ → MLE AT 1.0 MAP = 0.8

Three heads: the likelihood (red) peaks at the boundary; the prior (green) pulls the posterior (blue) back inside.

04 · The punchlineYour regulariser is a prior in disguise

Take MAP = MLE + log-prior and plug in a Gaussian prior over weights, p(w) = N(0, σ²I):

log p(w) = −‖w‖²/(2σ²) + const
⇒ θ̂MAP = argminw [ NLL(w) + λ‖w‖² ],   λ = 1/(2σ²)   — L2 / ridge / weight decay

A Laplace prior, p(w) ∝ exp(−|w|/b), gives log p(w) = −‖w‖₁/b + const — L1 / lasso, whose sharp peak at zero is why it produces exact sparsity. So the regularisation hyperparameter you tune by grid search is the variance of a belief: small λ = a loose prior ("weights may be anything"), large λ = a tight one ("weights are almost surely small"). Weight decay is not a hack bolted onto the loss; it is Bayesian inference with the posterior collapsed to its peak.

05 · The reconciliationThe likelihood eventually wins

In the MAP objective, the log-likelihood is a sum of n terms and the log-prior is one term. As n grows, the sum grows linearly and the prior stays put, so its relative weight decays like 1/n:

Σi=1n log p(xi | θ)  +  log p(θ)   ⟶   θ̂MAP → θ̂MLE  as n → ∞
RegimeWhat dominatesPractical reading
Small nPriorRegularisation matters enormously; MLE is dangerous.
Large nLikelihoodMLE and MAP nearly coincide; argue about priors less.
Any n, full posterior wantedNeither point estimateMAP keeps one point and discards uncertainty — Bayesian inference proper keeps the whole distribution.
Mental Model