Stop tagging tokens with coordinates; bias the scores by distance instead
Absolute schemes (learned tables, sinusoids) stamp a coordinate onto each token and trust the network to compute differences. But the quantity attention actually needs — "how far apart are the query and this key?" — is a property of the pair, and attention already has a place where every pair meets: the logit sij. So put position there.
Tokens carry no position at all. The model sees "this key is 7 back" directly, never "this key is at 1840 and I am at 1847". The family of methods differs only in what the function b is.
Train an absolute-position model on sequences of length 1024 and ask it to handle 2048. Positions 1025–2048 are entries the embedding table has no rows for, or sinusoid readings the network never learned to decode. Worse, even within the trained range the model has learned position-specific quirks (row 512 means whatever the data at position 512 tended to look like), which is memorisation of coordinates, not understanding of order.
A relative scheme cannot make that mistake by construction: offset 7 at position 100 and offset 7 at position 100,000 are literally the same input to b. The statistics of small offsets are abundant at every training length, so the function b is well-trained wherever it is evaluated. Length generalisation stops being an extrapolation problem and becomes an interpolation one.
| Method | b(i − j) is… | Parameters | Beyond trained length |
|---|---|---|---|
| Shaw et al. 2018 | learned embedding per offset, clipped at ±k; (originally added to keys, later simplified to a logit bias) | 2k+1 vectors per layer | clipping helps; offsets past k all map to the edge bucket |
| T5 buckets | learned scalar per bucket: exact buckets for small offsets, log-spaced buckets for large ones, shared across layers | ~32 scalars per head | decent; far offsets share the "very far" bucket |
| ALiBi | not learned at all: −m·|i − j|, a linear penalty with a fixed per-head slope m (slopes a geometric sequence across heads) | zero | best in class; the penalty extends to any distance by definition |
The progression is a steady removal of machinery. Shaw learns a vector per offset; T5 compresses to a scalar per bucket; ALiBi deletes the learning entirely and hard-codes "nearer is more relevant", letting different heads decay at different rates so some stay long-sighted. That such a blunt prior works tells you most of what position information does in practice: it implements recency.
Two shapes of b. T5 learns a coarse staircase over offsets; ALiBi fixes a straight penalty whose slope varies by head.
A scalar bias added to the logit is position information of the cheapest kind: it can make distance globally attractive or repulsive, but it cannot make the content match itself depend on position. "Attend to the token 3 back if it is a verb" requires position to interact with the query and key vectors, which a fixed additive term cannot do (Shaw's vector form can, at more cost). RoPE sits at exactly this gap — it puts the offset inside the q·k product, where it multiplies content rather than being added beside it, which is a large part of why it won.