LLMs

Sinusoidal Embeddings

A multi-frequency clock that turns position into geometry

01 · First principlesAttention cannot see order

Attention is a function of sets. Permute the input tokens and the outputs permute identically — nothing in QKTV depends on where a token sits, only on what it is. "Dog bites man" and "man bites dog" produce the same bag of attention outputs. This is called permutation equivariance, and for language it is a defect: order carries meaning, so order must be injected, because the architecture will never recover it on its own.

The cheapest injection is to add a position-dependent vector pt to each token embedding. The whole design question is: what should pt be?

02 · Naive attemptsTwo obvious encodings, two failures

pₜ = t (a counter)
Unbounded: position 5000 produces a vector component 5000× larger than position 1, drowning the token embedding it is added to. Scale depends on sequence length; nothing about it is stable.
pₜ = t / N (normalised counter)
Bounded, but the meaning of a step now depends on N: positions 3 and 4 are 0.1 apart in a 10-token sequence and 0.001 apart in a 1000-token one. "Next word" should mean the same thing everywhere.

What we actually want is a code where every position is distinct, every component is bounded, and — crucially — the relationship between positions t and t+k looks the same for every t. Adjacency should be a fixed geometric fact.

03 · The fixSinusoids at geometrically spaced frequencies

The Transformer paper's encoding fills the d-dimensional position vector with sine/cosine pairs, each pair ticking at its own frequency:

pt[2i] = sin(t · ωi)    pt[2i+1] = cos(t · ωi),    ωi = 10000−2i/d
frequencies from 1 down to 1/10000, geometrically spaced

The right analogy is binary counting. Write integers in binary and the lowest bit flips every step, the next every two steps, the next every four — fast and slow dials together identify the number uniquely. Sinusoids are the smooth version of the same idea: dimension pair 0 spins quickly (distinguishing neighbours), the last pair drifts over thousands of positions (encoding coarse location), and the full vector is a clock with d/2 hands. Bounded in [−1, 1], distinct for every position, no length normalisation anywhere.

POSITION t → i=0 i=4 i=8 FAST DIMENSIONS RESOLVE NEIGHBOURS · SLOW ONES ENCODE REGION

Three of the d/2 frequency channels. Reading all hands at once identifies t uniquely, like reading a clock's second, minute and hour hands.

04 · The property that mattersAn offset is a linear map

Here is the reason this encoding was chosen over any other bounded injective code. The angle-addition identities say:

sin((t+k)ω) = sin(tω)cos(kω) + cos(tω)sin(kω)
cos((t+k)ω) = cos(tω)cos(kω) − sin(tω)sin(kω)
coefficients depend on k only — not on t

In matrix form: pt+k = Rk pt, where Rk is a block-diagonal rotation whose angles depend only on the offset k. "Shift by k" is one fixed linear transformation, the same everywhere in the sequence. A linear attention head can therefore learn to ask "what was 3 tokens back?" as a single matrix — relative position becomes learnable by linear machinery, which is exactly what the model has. This rotation property is the seed that RoPE later promotes from "available if learned" to "enforced by construction".

05 · LimitsWhat it does not buy

Status today: rarely used in new LLMs, but the canonical baseline, and the cleanest place to learn why all positional schemes are really about making relative offsets easy to compute.
Mental Model