Buying bias to cut variance, in four currencies
A flexible model trained on a finite sample will spend part of its capacity fitting the noise in that particular sample — the variance term of the bias–variance decomposition. Regularisation is any device that restrains how freely the model can chase its training set. Every method in this note, however different it looks, does the same transaction: accept a little systematic error (bias) to suppress sensitivity to the sample (variance). There is no entry in the taxonomy that escapes paying; the craft is paying in the currency your problem feels least.
Add a term to the loss that charges for parameter magnitude: L(θ) + λΩ(θ). The two classic rents behave very differently at zero.
L2's pull is proportional to the weight, so it shrinks everything smoothly toward zero but never reaches it — small, distributed weights, smoother functions. L1's pull is constant regardless of size, so small weights get dragged exactly to zero and stay there: sparsity, free feature selection. The Bayesian reading makes the bias explicit: the penalty is a prior, and the regularised solution is the MAP estimate. L2 says "I believe weights are small" (Gaussian prior); L1 says "I believe most weights are exactly irrelevant" (Laplace prior). You are injecting a belief; that belief is the bias you bought. (For the optimiser-interaction fine print — L2 inside Adam is not weight decay — see AdamW.)
The second family corrupts the training signal so that fitting any one sample's quirks stops paying.
Architectural: constraints baked into the model family itself. Weight sharing in CNNs is the canonical case — declaring that the same filter applies at every spatial position collapses millions of free parameters into thousands, a hard prior of translation invariance. Hard priors are the strongest regularisers available, and the most biased: they are unbeatable when true (images) and crippling when false.
Implicit: regularisation nobody wrote down. Early stopping halts the optimiser before it can travel far enough from initialisation to fit the noise (for linear models it is provably ≈ an L2 penalty). The noise in SGD itself biases training toward flat minima, which tolerate the shift between training data and reality. Much of deep learning's generalisation comes from this unbilled category — part of why heavily overparameterised nets defy the naive capacity story.
| Family | Method | Mechanism | The bias you buy |
|---|---|---|---|
| Penalty | L2 / weight decay | shrink all weights (Gaussian prior) | smoother functions, small weights |
| L1 | drive weights to exact zero (Laplace prior) | sparsity — most features assumed irrelevant | |
| Noise | dropout | random subnetworks → implicit ensemble | no co-adapted features allowed |
| data augmentation | train on label-preserving transforms | the declared invariances | |
| label smoothing | soften one-hot targets | capped confidence | |
| Architectural | weight sharing (CNNs) | same filter everywhere — hard prior | translation invariance, true or not |
| Implicit | early stopping | bound distance from init (≈ L2) | solutions near the start preferred |
| SGD noise | kicked out of sharp minima | flat-minima preference |