General ML

Regularisation Methods

Buying bias to cut variance, in four currencies

01 · First principlesOne purpose, many disguises

A flexible model trained on a finite sample will spend part of its capacity fitting the noise in that particular sample — the variance term of the bias–variance decomposition. Regularisation is any device that restrains how freely the model can chase its training set. Every method in this note, however different it looks, does the same transaction: accept a little systematic error (bias) to suppress sensitivity to the sample (variance). There is no entry in the taxonomy that escapes paying; the craft is paying in the currency your problem feels least.

02 · PenaltiesCharge the weights rent

Add a term to the loss that charges for parameter magnitude: L(θ) + λΩ(θ). The two classic rents behave very differently at zero.

L2: Ω = ‖θ‖²₂ → gradient λθ → proportional shrinkage | L1: Ω = ‖θ‖₁ → gradient λ·sign(θ) → exact zeros

L2's pull is proportional to the weight, so it shrinks everything smoothly toward zero but never reaches it — small, distributed weights, smoother functions. L1's pull is constant regardless of size, so small weights get dragged exactly to zero and stay there: sparsity, free feature selection. The Bayesian reading makes the bias explicit: the penalty is a prior, and the regularised solution is the MAP estimate. L2 says "I believe weights are small" (Gaussian prior); L1 says "I believe most weights are exactly irrelevant" (Laplace prior). You are injecting a belief; that belief is the bias you bought. (For the optimiser-interaction fine print — L2 inside Adam is not weight decay — see AdamW.)

03 · NoiseMake memorisation a losing game

The second family corrupts the training signal so that fitting any one sample's quirks stops paying.

Dropout randomly silences each unit with probability p at each step, so no unit can rely on a specific co-conspirator existing; features must be individually useful. The cleaner reading: each step trains a random subnetwork, and the test-time model (weights scaled by the keep probability) approximates an ensemble average over exponentially many subnetworks — implicit ensembling, which is a variance-reduction device by definition.
Data augmentation injects the noise into inputs: crops, flips, colour jitter, noise. It is the most honest regulariser because the bias it buys is a stated invariance ("labels do not change under horizontal flip") — when the invariance is true, the bias costs nothing and you have manufactured data.
Label smoothing softens one-hot targets to (1−α) and α/K elsewhere: the model is forbidden to drive logits to infinity in pursuit of probability 1, which tempers overconfidence on noisy labels.

04 · Architecture & accidentBuilt-in and implicit regularisers

Architectural: constraints baked into the model family itself. Weight sharing in CNNs is the canonical case — declaring that the same filter applies at every spatial position collapses millions of free parameters into thousands, a hard prior of translation invariance. Hard priors are the strongest regularisers available, and the most biased: they are unbeatable when true (images) and crippling when false.

Implicit: regularisation nobody wrote down. Early stopping halts the optimiser before it can travel far enough from initialisation to fit the noise (for linear models it is provably ≈ an L2 penalty). The noise in SGD itself biases training toward flat minima, which tolerate the shift between training data and reality. Much of deep learning's generalisation comes from this unbilled category — part of why heavily overparameterised nets defy the naive capacity story.

05 · The taxonomyOne table

Family	Method	Mechanism	The bias you buy
Penalty	L2 / weight decay	shrink all weights (Gaussian prior)	smoother functions, small weights
Penalty	L1	drive weights to exact zero (Laplace prior)	sparsity — most features assumed irrelevant
Noise	dropout	random subnetworks → implicit ensemble	no co-adapted features allowed
	data augmentation	train on label-preserving transforms	the declared invariances
	label smoothing	soften one-hot targets	capped confidence
Architectural	weight sharing (CNNs)	same filter everywhere — hard prior	translation invariance, true or not
Implicit	early stopping	bound distance from init (≈ L2)	solutions near the start preferred
Implicit	SGD noise	kicked out of sharp minima	flat-minima preference

Reading the last column: every row names the bias purchased. If a method seems to reduce variance for free, you have not found its bias yet — diagnosis of whether you even need the purchase is the subject of overfitting / underfitting.

Mental Model

Regularisation = any restraint on how freely the model chases the training sample; all of it trades bias for variance.
Penalties are priors: L2 believes weights are small, L1 believes most are exactly zero.
Noise methods make memorisation unprofitable; dropout is secretly an ensemble.
Architecture is the strongest regulariser — weight sharing is a hard prior you cannot turn off.
Some of the best regularisation is implicit (early stopping, SGD noise) and arrives unbilled.