General ML

Linear Regression

The baseline that must be beaten

01 · First principlesThe simplest hypothesis

We want to predict a number y from features x. The simplest committed guess is that y responds linearly:

y  =  wTx + b + ε

Everything about the model is in that one line: each feature contributes independently, proportionally, and additively. That assumption is strong (a high-bias bet — see bias–variance), which is exactly why the model is stable, interpretable, and hard to beat on small clean data.

The question that defines the method: which w? We need a notion of "best fit".

02 · Why least squaresSquared error is Gaussian MLE

Least squares is not an arbitrary choice. Assume the noise ε is Gaussian, ε ~ N(0, σ²), and maximise the likelihood of the data:

log p(y|X, w) = Σi log N(yi | wTxi, σ²) = const − (1/2σ²) Σi(yi − wTxi

Maximising the likelihood is minimising the sum of squared residuals. The loss follows from the noise model, not the other way round (heavier-tailed noise would give absolute error instead — see loss functions).

03 · The solutionThe normal equation is a projection

Setting the gradient of the squared loss to zero gives a closed form:

w* = (XTX)−1 XTy

The geometry is the real content: ŷ = Xw can only live in the column space of X, so the best ŷ is the orthogonal projection of y onto that subspace (see image space). The residual vector is perpendicular to every feature; the model has extracted all linear signal, and what remains is, linearly speaking, unexplainable.

x y ŷ = wx + b residuals

Least squares minimises the summed squared vertical distances; the fit is the projection of y onto the model's reachable set.

In practice we never invert. (XTX)−1 is written for theory; software solves the linear system (QR or Cholesky) instead — cheaper and numerically safer (see matrix inverse).

04 · Failure and fixWhen XTX is nearly singular

The closed form breaks precisely when features are redundant. If two columns of X are nearly collinear, XTX is nearly singular (see singular matrices): the data cannot distinguish their contributions, so coefficients become huge, opposite-signed, and wildly sensitive to noise — low bias, exploding variance.

The fix is the standard one: add λI before solving.

wridge = (XTX + λI)−1 XTy

Ridge regression buys back stability by shrinking coefficients toward zero; equivalently it is MAP estimation with a Gaussian prior on w (see MLE vs MAP). λ is a bias–variance dial. Lasso (L1) goes further and zeroes coefficients out entirely.

05 · DiagnosisRead the residuals

The fitted line says little; the residuals say everything. Plot them against ŷ: a structureless cloud means the linear story is adequate. A curve means missing nonlinearity (add features or change model), a funnel means non-constant variance (the Gaussian-noise assumption is off), and isolated extreme residuals mean outliers are steering the fit (squared loss amplifies them quadratically).

Mental Model