Linear Algebra

Orthogonality

Zero overlap, no shared information

01 · First principlesWhat does a·b = 0 actually buy?

The dot product measures agreement, and orthogonality is the special case of none: a·b = 0 means the shadow of b along a has zero length. Neither vector carries any information about the other; whatever you learn by measuring along a tells you exactly nothing about the component along b.

That sounds like a negative property — an absence of relationship. The surprise of this note is that the absence is the most valuable structure in computational linear algebra. When directions do not interfere, every question about a vector decomposes into independent one-dimensional questions, and problems that otherwise require solving systems collapse into reading off dot products.

Working slogan: orthogonal directions are separate channels. No crosstalk, so each can be handled alone.

02 · The payoffCoordinates without solving anything

To express x in an arbitrary basis {b₁, …, bₙ} you must solve a linear system: the coefficients are entangled, because each basis vector leaks into the others' directions. Watch what happens with an orthonormal basis {q₁, …, qₙ} (mutually orthogonal, unit length). Write x = c₁q₁ + ⋯ + cₙqₙ and take the dot product of both sides with qᵢ:

x · qᵢ = c₁(q₁·qᵢ) + ⋯ + cᵢ(qᵢ·qᵢ) + ⋯ + cₙ(qₙ·qᵢ) = cᵢ
every cross term dies; only qᵢ·qᵢ = 1 survives

So cᵢ = x·qᵢ, full stop. Each coordinate is one dot product, computed independently of all the others; the n-dimensional problem fell apart into n one-dimensional projections. This identity — x = Σ (x·qᵢ) qᵢ — is the engine inside Fourier series, PCA coordinates, and every "project onto components" argument you have ever seen.

q₁ q₂ x (x·q₁) q₁ (x·q₂) q₂ EACH COORDINATE = ONE SHADOW. NO SYSTEM TO SOLVE.

In an orthonormal frame, x is rebuilt from its two shadows independently. With a skewed basis, the shadows would double-count and a system would have to untangle them.

03 · The matricesOrthogonal matrices: rigid motions

Pack an orthonormal basis into the columns of Q. The orthonormality conditions qᵢ·qⱼ = δᵢⱼ are precisely the statement QᵀQ = I, which hands us the inverse for free:

QᵀQ = I  ⟹  Q⁻¹ = Qᵀ
the one family of matrices whose inverse costs nothing

Geometrically, Q is a rigid motion — a rotation or reflection. It preserves every length and every angle, because it preserves the dot product itself: (Qx)·(Qy) = xᵀQᵀQy = x·y. Space is moved, never distorted: the unit circle stays a unit circle. Contrast this with a general invertible matrix, which shears and stretches, and whose inverse must be earned by elimination. Numerically, multiplying by Q is perfectly conditioned (κ(Q) = 1): it amplifies neither the signal nor the rounding error, which is why stable algorithms are built almost entirely out of orthogonal transformations.

04 · The workhorseProjection, QR, and least squares

When Ax = b has no solution (b outside the image), the best we can do is the x making Ax closest to b. "Closest" means the error b − Ax is orthogonal to the image — the perpendicular from b to the reachable subspace. That orthogonality condition is the normal equations: Aᵀ(b − Ax) = 0.

In practice one does not solve the normal equations directly (forming AᵀA squares the condition number). Instead, factor A = QR with Q orthonormal and R triangular — Gram–Schmidt, made industrial. Then projection onto the image is QQᵀb, and the triangular system Rx = Qᵀb finishes the job stably. Orthogonalise first, and the hard geometry becomes bookkeeping.

05 · Why ML caresOrthogonality as a design principle

  1. Orthogonal initialisation. Initialise a deep linear layer with an orthogonal matrix and signals pass through with norms intact (‖Qx‖ = ‖x‖); gradients neither explode nor vanish through that layer. The same instinct underlies spectral and orthogonality-regularised training of RNNs, where repeated multiplication makes any deviation from norm preservation compound exponentially (see eigenvalues and dynamics).
  2. Decorrelated features. Correlated inputs are a skewed basis: weight updates interfere, and the loss surface elongates (a conditioning problem, see the Hessian). Whitening — rotating to the orthogonal eigenbasis of the covariance and rescaling — turns the basis orthonormal and the bowl round.
  3. QR and least squares sit inside every classical regression, and orthogonal (Householder) transformations are the substrate of the QR algorithm that computes eigenvalues and SVDs — the factorisations PCA and LoRA-style analyses rest on.
The named connection: whitening and orthogonal init are the same idea at two scales — make directions non-interfering, so that learning along one does not corrupt another.
Mental Model