The baseline that must be beaten
We want to predict a number y from features x. The simplest committed guess is that y responds linearly:
Everything about the model is in that one line: each feature contributes independently, proportionally, and additively. That assumption is strong (a high-bias bet — see bias–variance), which is exactly why the model is stable, interpretable, and hard to beat on small clean data.
The question that defines the method: which w? We need a notion of "best fit".
Least squares is not an arbitrary choice. Assume the noise ε is Gaussian, ε ~ N(0, σ²), and maximise the likelihood of the data:
Maximising the likelihood is minimising the sum of squared residuals. The loss follows from the noise model, not the other way round (heavier-tailed noise would give absolute error instead — see loss functions).
Setting the gradient of the squared loss to zero gives a closed form:
The geometry is the real content: ŷ = Xw can only live in the column space of X, so the best ŷ is the orthogonal projection of y onto that subspace (see image space). The residual vector is perpendicular to every feature; the model has extracted all linear signal, and what remains is, linearly speaking, unexplainable.
Least squares minimises the summed squared vertical distances; the fit is the projection of y onto the model's reachable set.
The closed form breaks precisely when features are redundant. If two columns of X are nearly collinear, XTX is nearly singular (see singular matrices): the data cannot distinguish their contributions, so coefficients become huge, opposite-signed, and wildly sensitive to noise — low bias, exploding variance.
The fix is the standard one: add λI before solving.
Ridge regression buys back stability by shrinking coefficients toward zero; equivalently it is MAP estimation with a Gaussian prior on w (see MLE vs MAP). λ is a bias–variance dial. Lasso (L1) goes further and zeroes coefficients out entirely.
The fitted line says little; the residuals say everything. Plot them against ŷ: a structureless cloud means the linear story is adequate. A curve means missing nonlinearity (add features or change model), a funnel means non-constant variance (the Gaussian-noise assumption is off), and isolated extreme residuals mean outliers are steering the fit (squared loss amplifies them quadratically).