LLMs

RoPE

Rotate queries and keys; relative position falls out of the dot product

01 · First principlesWhat attention should actually see

For most of language, what matters is not "this token is at position 1847" but "this token is 3 before that one". The attention score between a query and a key should therefore depend on their offset, not their absolute coordinates. Sinusoidal embeddings make offsets linearly accessible but still hand the model absolute positions and hope it learns to subtract. Relative bias schemes get the right dependence but bolt an extra term onto the logits.

RoPE asks a sharper question: can we encode position so that the dot product — the thing attention already computes — depends on relative offset by construction?

02 · The mechanismPosition as rotation

Split the d-dimensional query into d/2 two-dimensional pairs. For a query at position m, rotate each pair by an angle proportional to m; do the same to keys at their positions. Each pair i has its own base frequency θ_i = 10000^−2i/d, the same geometric ladder as the sinusoidal scheme:

q_m = R(mθ) q, k_n = R(nθ) k

R(α) = 2×2 rotation by angle α, applied per pair

No vector is added to the embedding; position never enters the residual stream or the values. It exists only in the instant the query meets the key.

Each 2D pair spins on its own clock face. The dot product measures the angle between the two hands, and that angle depends only on m − n.

03 · The two-line argumentWhy only the offset survives

Rotations are orthogonal, so R(α)^T = R(−α), and consecutive rotations add their angles. Take the score between a rotated query and rotated key:

q_m · k_n = (R(mθ)q)^T(R(nθ)k) = q^T R(−mθ) R(nθ) k = q^T R((n−m)θ) k

m and n appear only as their difference

That is the entire proof. The mechanism is absolute — each token is rotated by its own position, independently, with no knowledge of any other token — yet the behaviour is relative, because the dot product cancels the common part. Shift the whole sequence by 1000 positions and every attention score is unchanged.

Absolute mechanism, relative behaviour. This is why RoPE costs nothing at inference: each cached key is rotated once, at its own position, and the relative arithmetic happens for free inside QK^T. No extra bias matrix, no per-pair lookup.

04 · The fine printFrequencies, decay, and the long-context catch

The geometric frequency ladder matters. High-frequency pairs make scores oscillate quickly with offset, resolving adjacent tokens; low-frequency pairs vary slowly, carrying long-range structure. Summed across pairs, the interference produces a useful side effect: attention between random vectors tends to decay with distance, a mild built-in locality prior.

The catch is the same one as always: positions beyond the training length put the slow channels at angles the model has never seen, and quality collapses. Because RoPE's positions are continuous angles, however, there is a cheap remedy unavailable to lookup-table schemes — rescale the base frequency so that the largest trained angle covers the new length. NTK-aware scaling and YaRN do exactly this (YaRN rescaling frequencies unevenly, sparing the high-frequency pairs that encode local order), and they extend context windows several-fold with little or no finetuning.

Scheme	Where position lives	Score depends on	Extends past trained length?
Learned absolute	added to embedding	m and n separately	no (table runs out)
Sinusoidal	added to embedding	m and n (offsets learnable)	poorly
RoPE	rotation of q, k	m − n by construction	yes, with base rescaling (NTK/YaRN)

Mental Model

Rotate each 2D slice of q and k by (position × per-pair frequency); add nothing to the embeddings.
R(−mθ)R(nθ) = R((n−m)θ): the dot product cancels absolute position and keeps only the offset. Two lines, whole idea.
Absolute mechanism, relative behaviour — so the KV cache works unchanged and inference pays nothing extra.
Multi-frequency ladder: fast pairs resolve neighbours, slow pairs carry range, interference gives mild distance decay.
Context extension = frequency rescaling (NTK/YaRN), possible only because position is a continuous angle, not a table entry.