Linear Algebra

Singular Matrices

The map loses a dimension, and there is no way back

01 · First principlesWhat does "singular" mean physically?

A square matrix is singular when the transformation it performs loses a dimension: it takes n-dimensional space and flattens it into something thinner — a plane into a line, a volume into a sheet. Flattening is irreversible. Once two distinct inputs have been pressed onto the same output, no map can tell them apart again; the information is not hidden, it is gone.

That is the whole concept. Every algebraic test for singularity — determinant, rank, eigenvalues — is just a different instrument for detecting the same physical event: somewhere, a direction of space got crushed flat.

The image of the unit square as the matrix slides toward singularity: the parallelogram thins, then degenerates to a segment. A dimension has left the output.

02 · One fact, five costumesThe equivalence list

For a square n×n matrix A, the following are all equivalent — not five related facts but one fact, "A crushes a direction", reported by five different instruments.

Costume	Statement	What the instrument measures
det A = 0	volume scale factor is zero	the unit cube is flattened — determinant
rank < n	columns do not span ℝⁿ	outputs live in a thinner subspace — rank & span
null space ≠ {0}	some x ≠ 0 has Ax = 0	a whole direction is sent to the origin — null space
A⁻¹ does not exist	no map undoes A	two inputs share an output — inverse
λ = 0 is an eigenvalue	Av = 0·v for some v ≠ 0	the crushed direction, named — eigenvectors

The translation exercises are short and worth doing once. Zero eigenvalue ⇔ nontrivial null space: the eigenvector for λ = 0 is a null vector, by definition. det = 0 ⇔ rank-deficient: the determinant is the volume of the parallelepiped spanned by the columns, and dependent columns span a flat one. Each arrow is one sentence; the wisdom is in refusing to treat the five as separate things to memorise.

03 · How it breaks in practiceExactly singular is rare; nearly singular is everywhere

Here is the twist that matters for practice. With floating-point data, a matrix is almost never exactly singular — round any entry in the last bit and det ≠ 0 again. The practical enemy is the nearly singular matrix: full rank on paper, but with some direction squashed almost flat (smallest singular value σ_min ≈ 0). All five costumes then read "technically fine", while the condition number κ = σ_max/σ_min explodes and every solve amplifies noise by κ (the mechanics are in matrix inverse).

Where do near-singular matrices come from in ML? From redundancy. Two nearly-duplicate features make two columns of X nearly dependent, so XᵀX has a near-zero eigenvalue. More data dimensions than effective degrees of freedom, correlated parameters, nearly-collinear basis functions — every form of "the data does not really determine this direction" shows up as σ_min → 0.

Reading a warning sign: when a solver returns huge, wildly oscillating coefficients that fit the data perfectly, you are watching 1/σ_min at work — the model is dividing by a near-collapse.

04 · The fixWhy we add λI

The universal remedy is regularisation: replace the offending matrix by A + λI (for symmetric A, typically XᵀX or a Hessian). The effect is surgical. A + λI has the same eigenvectors as A, and every eigenvalue is lifted by exactly λ:

Av = μv ⟹ (A + λI)v = (μ + λ)v

every eigenvalue raised by λ; zero becomes λ; κ drops to (μ_max+λ)/(μ_min+λ)

The crushed directions — the ones the data says nothing about — get a floor of λ instead of zero, so the solve stops dividing by nothing there; the healthy directions barely notice. The price is a small bias: we are answering a slightly different question in exchange for the answer being stable. This one move, under different names:

Ridge regression / weight decay: (XᵀX + λI)⁻¹Xᵀy — collinear features get small, shared weights instead of exploding opposite ones.
Levenberg–Marquardt damping: Gauss–Newton with (JᵀJ + λI), so steps stay sane where the Jacobian is rank-deficient.
The ε in Adam, jitter in GP kernels: the same floor-under-the-denominator instinct, applied diagonally.

The named connection: ridge regression is the canonical ML answer to near-singularity — pay λ worth of bias to buy back the dimensions the data almost lost.

Mental Model

Singular = the map flattens space by at least one dimension; flattening destroys information, so there is no inverse.
det = 0, rank < n, nontrivial null space, no inverse, zero eigenvalue: one fact in five costumes.
Exactly singular is a measure-zero curiosity; nearly singular (σ_min ≈ 0, huge κ) is the everyday enemy.
Near-singularity in ML means redundancy: duplicated features, underdetermined directions.
Adding λI lifts every eigenvalue by λ — a small bias bought deliberately, to stop dividing by almost-zero.