01 · First principlesOne inequality, every direction
A symmetric matrix A defines a landscape: to every direction x it assigns the number xᵀAx (the quadratic form — for unit x, "how much A-ness lies along x", a second-order cousin of the dot product). Positive semi-definiteness is a single promise about that landscape:
xᵀAx ≥ 0 for every x
no direction is downhill; the landscape never dips below zero
Strictly positive for all x ≠ 0 makes A positive definite (PD); allowing equality makes it semi-definite (PSD), where some directions are exactly flat. The definition sounds abstract until you notice that two of the most common matrices in ML satisfy it automatically, and that the property is exactly what makes "variance", "energy", and "convex bowl" meaningful words.
02 · The pictureBowl versus saddle
Plot f(x) = ½xᵀAx. If A is PD, every direction curves upward: a bowl, with one bottom. If A has a negative eigenvalue, some direction curves down: a saddle — uphill one way, downhill another, and "the minimum" stops existing. PSD is the boundary case: a bowl with some perfectly flat valley directions (the null space of A).
Level sets of ½xᵀAx. PD gives nested ellipses around a true minimum; one negative eigenvalue bends the contours into hyperbolas and opens an escape route downhill.
The eigenvalue translation is immediate. In the eigenbasis (spectral theorem — symmetric A is a rotate–stretch–rotate-back), the form becomes xᵀAx = Σ λᵢ yᵢ², a weighted sum of squares. It is nonnegative for all inputs precisely when every weight is nonnegative: PSD ⇔ all λᵢ ≥ 0. PSD matrices are the matrices with no negative stretch anywhere.
03 · The two familiesWhy covariance and Gram matrices are born PSD
Nearly every PSD matrix you will meet belongs to one of two families, and each is PSD for a one-line reason worth internalising.
Covariance Σ = E[(z−μ)(z−μ)ᵀ]
For any direction u, the form is a variance: uᵀΣu = E[(uᵀ(z−μ))²] = Var(uᵀz) ≥ 0. A negative value would mean negative variance along some projection — nonsense. PSD is the matrix-shaped statement "spread cannot be negative".
Gram / AᵀA
For any x: xᵀ(AᵀA)x = (Ax)ᵀ(Ax) = ‖Ax‖² ≥ 0. Squared lengths cannot be negative. Kernel matrices, XᵀX in regression, and JᵀJ in Gauss–Newton are all this family; zero eigenvalues appear exactly where A has a null space.
In fact the second family is exhaustive: every PSD matrix can be written as BᵀB for some B. PSD matrices are precisely the matrices that are secretly a "squared" map — which is why they behave like nonnegative numbers among matrices, and why the next section's square root exists.
04 · The working toolCholesky: the square root you actually use
For a PD matrix there is a canonical, cheap factorisation — the Cholesky decomposition:
A = L Lᵀ, L lower-triangular, positive diagonal
half the cost of LU; exists iff A is PD (the practical PD test)
L behaves as the working square root of A, and two everyday operations run on it:
Sampling correlated Gaussians. If z ~ N(0, I) then x = μ + Lz has covariance E[Lz(Lz)ᵀ] = LLᵀ = Σ. The triangular L is the map that "installs" the desired correlations into white noise — every multivariate-Gaussian sampler (VAE reparameterisation with full covariance, GP posterior draws) is this line.
Solving SPD systems. Σ⁻¹b in a Gaussian log-density or (K + σ²I)⁻¹y in GP regression is two triangular solves through L — never an explicit inverse — and log det Σ = 2Σ log Lᵢᵢ falls out of the diagonal for free (see determinant).
When a "covariance" fails Cholesky because round-off produced a tiny negative eigenvalue, the standard fix is the λI lift from singular matrices: add jitter, restore the floor.
The named connection: PSD is the precondition for optimisation's nicest regime — a PSD Hessian is what "locally convex" means, Gauss–Newton replaces the true Hessian with the born-PSD JᵀJ to guarantee downhill steps, and Adam's denominator is a diagonal PSD estimate. Convexity, variance, and kernels are one property wearing three coats.
Mental Model
PSD = the quadratic form never dips below zero: a bowl (possibly with flat valley directions), never a saddle.
Spectral reading: symmetric with all λ ≥ 0; the form is Σλᵢyᵢ², a sum of nonnegatively-weighted squares.
Covariances are PSD because variance cannot be negative; Gram matrices because squared lengths cannot be.
Every PSD matrix is secretly BᵀB — a squared map; Cholesky L is the working square root.
Sampling N(μ, Σ) is μ + Lz; solving and log-dets ride on the same L. PSD is where the safe, fast algorithms live.