Rotate queries and keys; relative position falls out of the dot product
For most of language, what matters is not "this token is at position 1847" but "this token is 3 before that one". The attention score between a query and a key should therefore depend on their offset, not their absolute coordinates. Sinusoidal embeddings make offsets linearly accessible but still hand the model absolute positions and hope it learns to subtract. Relative bias schemes get the right dependence but bolt an extra term onto the logits.
RoPE asks a sharper question: can we encode position so that the dot product — the thing attention already computes — depends on relative offset by construction?
Split the d-dimensional query into d/2 two-dimensional pairs. For a query at position m, rotate each pair by an angle proportional to m; do the same to keys at their positions. Each pair i has its own base frequency θi = 10000−2i/d, the same geometric ladder as the sinusoidal scheme:
No vector is added to the embedding; position never enters the residual stream or the values. It exists only in the instant the query meets the key.
Each 2D pair spins on its own clock face. The dot product measures the angle between the two hands, and that angle depends only on m − n.
Rotations are orthogonal, so R(α)T = R(−α), and consecutive rotations add their angles. Take the score between a rotated query and rotated key:
That is the entire proof. The mechanism is absolute — each token is rotated by its own position, independently, with no knowledge of any other token — yet the behaviour is relative, because the dot product cancels the common part. Shift the whole sequence by 1000 positions and every attention score is unchanged.
The geometric frequency ladder matters. High-frequency pairs make scores oscillate quickly with offset, resolving adjacent tokens; low-frequency pairs vary slowly, carrying long-range structure. Summed across pairs, the interference produces a useful side effect: attention between random vectors tends to decay with distance, a mild built-in locality prior.
The catch is the same one as always: positions beyond the training length put the slow channels at angles the model has never seen, and quality collapses. Because RoPE's positions are continuous angles, however, there is a cheap remedy unavailable to lookup-table schemes — rescale the base frequency so that the largest trained angle covers the new length. NTK-aware scaling and YaRN do exactly this (YaRN rescaling frequencies unevenly, sparing the high-frequency pairs that encode local order), and they extend context windows several-fold with little or no finetuning.
| Scheme | Where position lives | Score depends on | Extends past trained length? |
|---|---|---|---|
| Learned absolute | added to embedding | m and n separately | no (table runs out) |
| Sinusoidal | added to embedding | m and n (offsets learnable) | poorly |
| RoPE | rotation of q, k | m − n by construction | yes, with base rescaling (NTK/YaRN) |