General ML

Linear Regression

The baseline that must be beaten

01 · First principlesThe simplest hypothesis

We want to predict a number y from features x. The simplest committed guess is that y responds linearly:

y = w^Tx + b + ε

Everything about the model is in that one line: each feature contributes independently, proportionally, and additively. That assumption is strong (a high-bias bet — see bias–variance), which is exactly why the model is stable, interpretable, and hard to beat on small clean data.

The question that defines the method: which w? We need a notion of "best fit".

02 · Why least squaresSquared error is Gaussian MLE

Least squares is not an arbitrary choice. Assume the noise ε is Gaussian, ε ~ N(0, σ²), and maximise the likelihood of the data:

log p(y|X, w) = Σ_i log N(y_i | w^Tx_i, σ²) = const − (1/2σ²) Σ_i(y_i − w^Tx_i)²

Maximising the likelihood is minimising the sum of squared residuals. The loss follows from the noise model, not the other way round (heavier-tailed noise would give absolute error instead — see loss functions).

03 · The solutionThe normal equation is a projection

Setting the gradient of the squared loss to zero gives a closed form:

w* = (X^TX)⁻¹ X^Ty

The geometry is the real content: ŷ = Xw can only live in the column space of X, so the best ŷ is the orthogonal projection of y onto that subspace (see image space). The residual vector is perpendicular to every feature; the model has extracted all linear signal, and what remains is, linearly speaking, unexplainable.

Least squares minimises the summed squared vertical distances; the fit is the projection of y onto the model's reachable set.

In practice we never invert. (X^TX)⁻¹ is written for theory; software solves the linear system (QR or Cholesky) instead — cheaper and numerically safer (see matrix inverse).

04 · Failure and fixWhen X^TX is nearly singular

The closed form breaks precisely when features are redundant. If two columns of X are nearly collinear, X^TX is nearly singular (see singular matrices): the data cannot distinguish their contributions, so coefficients become huge, opposite-signed, and wildly sensitive to noise — low bias, exploding variance.

The fix is the standard one: add λI before solving.

w_ridge = (X^TX + λI)⁻¹ X^Ty

Ridge regression buys back stability by shrinking coefficients toward zero; equivalently it is MAP estimation with a Gaussian prior on w (see MLE vs MAP). λ is a bias–variance dial. Lasso (L1) goes further and zeroes coefficients out entirely.

05 · DiagnosisRead the residuals

The fitted line says little; the residuals say everything. Plot them against ŷ: a structureless cloud means the linear story is adequate. A curve means missing nonlinearity (add features or change model), a funnel means non-constant variance (the Gaussian-noise assumption is off), and isolated extreme residuals mean outliers are steering the fit (squared loss amplifies them quadratically).

Mental Model

Linear regression = orthogonal projection of y onto the span of the features; least squares is what Gaussian noise + MLE forces.
The normal equation is theory; solving the system is practice.
Collinear features → near-singular X^TX → exploding coefficient variance; ridge's λI is the antidote (and a Gaussian prior in disguise).
Trust the residual plot, not the R². It is also the baseline: anything fancier must beat it to justify its variance.

01 · First principlesThe simplest hypothesis

02 · Why least squaresSquared error is Gaussian MLE

03 · The solutionThe normal equation is a projection

04 · Failure and fixWhen XTX is nearly singular

05 · DiagnosisRead the residuals

04 · Failure and fixWhen X^TX is nearly singular