Linear Algebra

Dot Product

One number answering: how much do two directions agree

01 · First principlesThe question the dot product answers

Take two vectors a and b. Without the dot product, we cannot answer the most basic geometric questions: do these two directions point the same way? At all? How long is a? What angle sits between them? Every notion of length, angle, and similarity in linear algebra is built from this single operation.

The dot product compresses two vectors into one scalar that measures agreement: large and positive when they pull together, zero when they share nothing, negative when they oppose.

Working slogan: the dot product is "how much of b lies along a", scaled by how long a is.

02 · Two facesAlgebraic and geometric, the same number

There are two definitions, and the entire usefulness of the dot product is that they coincide.

a · b = Σ aᵢ bᵢ = ‖a‖ ‖b‖ cos θ

coordinates (cheap to compute) geometry (what it means)

Why are they the same thing? The coordinate sum is bilinear: it distributes over addition in each slot. Apply it to ‖a − b‖² = (a − b)·(a − b) and expand: ‖a‖² + ‖b‖² − 2 a·b. The law of cosines says the same squared distance equals ‖a‖² + ‖b‖² − 2‖a‖‖b‖cos θ. Equate the two and a·b = ‖a‖‖b‖cos θ falls out. The coordinate formula is the geometry, just paid for in arithmetic.

Two immediate corollaries: ‖a‖ = √(a·a), so length is a dot product with yourself; and cos θ = a·b / (‖a‖‖b‖), which is exactly cosine similarity.

03 · The pictureProjection is the working definition

Drop a perpendicular from the tip of b onto the line through a. The shadow that lands there is the projection of b onto a, and the dot product is its (signed) length times ‖a‖.

The dot product only sees the shadow. Everything in b perpendicular to a is invisible to a·b.

This is the definition worth carrying around: a·b asks "if I only cared about the direction of a, how much of b would I keep?" The perpendicular remainder b − proj_a(b) is the part a cannot see — the seed of orthogonality.

04 · Reading the numberSign and magnitude

Value of a·b	Geometry	Read as
Large positive	θ near 0°, shadows long	Strong agreement, similar directions
Zero	θ = 90°, shadow vanishes	No shared information — orthogonal
Negative	θ past 90°	Active disagreement, opposing directions

One caution: raw magnitude conflates angle with length. Two long vectors at 80° can out-score two short vectors at 5°. When only direction matters, normalise first (cosine similarity); when magnitude is meaningful (confidence, intensity), keep it.

05 · Why ML caresThe atom of machine learning

Strip away the framework code and almost every primitive in ML is a dot product asking "how much do these agree?"

Every neuron. A unit computing wᵀx + b is a template matcher: w is the pattern it looks for, the dot product scores how much of that pattern is present in x, the bias sets the threshold. A layer is many such questions asked at once; a matrix–vector product is nothing but stacked dot products.
Attention logits. The score between query and key is qᵀk (scaled by √d). Attention is a similarity search implemented entirely with dot products, which is why it maps so well onto matrix-multiply hardware.
Embeddings and retrieval. "Related" items are placed so their vectors have a large dot product; nearest-neighbour search, contrastive losses, and recommendation scores all reduce to it.
Loss landscapes. The directional derivative of a loss along direction v is ∇L · v — gradient descent moves along the direction whose dot product with the gradient is most negative.

Scale note: the √d in attention exists because a sum of d independent products has standard deviation growing like √d; without rescaling, logits saturate the softmax as dimensions grow.

Mental Model

a·b is one scalar measuring agreement: positive together, zero orthogonal, negative opposed.
Σaᵢbᵢ and ‖a‖‖b‖cos θ are the same number — coordinates are how you compute, projection is what it means.
The dot product only sees the shadow of b along a; the perpendicular part is invisible to it.
Length, angle, similarity, neuron activations, attention scores: all are this one operation wearing different clothes.