Linear Algebra

Inverse of a Matrix

The map that undoes A — and why you should almost never build it

01 · First principlesUndoing a transformation

A matrix A moves every point of space somewhere. The inverse asks the most natural follow-up question: is there a second map that moves everything back? A⁻¹ is defined by exactly that — apply A, then A⁻¹, and every vector returns home:

A⁻¹A = AA⁻¹ = I

Geometrically, if A rotates 30° and stretches x by 2, then A⁻¹ shrinks x by 2 and rotates back 30°: same movie, reversed. The algebraic payoff is that Ax = b solves formally as x = A⁻¹b — "which input produced this output?" answered by running the film backwards.

02 · ExistenceYou can only undo what was not destroyed

When does the reverse map exist? Precisely when A destroys no information. If two different inputs land on the same output, no map can send that output back to both — undoing is ambiguous. So A is invertible exactly when it is one-to-one, which for square matrices is one fact wearing several costumes: trivial null space, full rank, det ≠ 0, no zero eigenvalue. (The full equivalence list lives in singular matrices; that note is this note's photographic negative.)

The 2×2 case shows the mechanism nakedly:

A = [a b; c d]  ⟹  A⁻¹ = (1 / det A) · [d −b; −c a]
the determinant sits in the denominator

The swap-and-negate part is benign; everything dangerous lives in the 1/det. As A approaches collapse (det → 0), the entries of A⁻¹ blow up toward infinity. Undoing a near-collapse requires a near-infinite stretch — the geometric seed of every numerical problem below.

UNIT SQUARE A SHEARED + STRETCHED A⁻¹ — THE SAME MOVIE, REVERSED

A⁻¹ exists because the parallelogram still has area: nothing was flattened, so every output remembers its input. Flatten it (det = 0) and the way back is gone.

03 · How it breaksx = A⁻¹b is a formula, not an algorithm

The naive reading of x = A⁻¹b — "compute A⁻¹, then multiply" — is how textbooks write and how production code should never run. Computing the explicit inverse means solving n linear systems (one per column of I), roughly three times the work of solving the one system you actually wanted; it then spends a dense matrix–vector multiply repeating work a triangular solve does cheaper. And it is numerically the worse path: the explicit inverse loses accuracy in the formation step and again in the multiply, while a factor-and-solve commits its rounding error once, in a backward-stable way.

GoalInverse habitWorking habit
Solve Ax = bx = inv(A) @ bfactor once (LU/Cholesky), then solve(A, b)
Many right-hand sidesform A⁻¹, reuse itreuse the factorisation, solve per b
The expression A⁻¹Binv(A) @ Bsolve(A, B) — same object, computed sanely
Rule of thumb: in formulas, A⁻¹ is notation meaning "the solution of a system involving A". Read it as solve, not as a matrix you instantiate.

04 · The real enemyConditioning: near-singular means error-amplifying

Existence of A⁻¹ is a binary question; usefulness is not. The condition number κ(A) = σ_max/σ_min (largest over smallest singular value — the most-stretched over the most-squashed direction) measures how close to collapse the map is, and it governs a hard inequality: a relative error δ in b can become a relative error up to κ(A)·δ in the solution x.

‖δx‖/‖x‖  ≤  κ(A) · ‖δb‖/‖b‖
κ ≈ 10ᵏ ⟹ expect to lose k digits

The geometry: A squashes some direction nearly flat, so undoing A stretches that direction enormously — and any noise in b that happens to lie along it gets stretched too. With κ = 10⁸ and float32 (about 7 digits), the answer can be pure noise while the code runs without a single warning. This is why "is A invertible?" is the wrong practical question; the right one is "what is κ(A)?", and the cures for a bad answer — regularisation, orthogonal factorisations — are the subject of singular matrices and orthogonality.

05 · Why ML caresInverses you use but never form

  1. Newton's method updates by −H⁻¹∇L (see the Hessian): implemented as a linear solve or approximated (L-BFGS, conjugate gradient), never as an explicit inverse of a million-parameter matrix.
  2. Gaussian everything. The Gaussian log-density contains Σ⁻¹; GP regression contains (K + σ²I)⁻¹y. Both are computed as Cholesky solves (see PSD matrices), and the +σ²I exists partly to keep κ sane.
  3. Least squares is written (XᵀX)⁻¹Xᵀy and computed by QR — the textbook formula squares the condition number, the algorithm refuses to.
The named connection: in ML code, A⁻¹ is a unit of meaning (undo this map, whiten by this covariance, rescale by this curvature) and a linear solve is the unit of computation. Keeping the two separate is half of numerical literacy.
Mental Model