LLMs

Relative Positional Embeddings

Stop tagging tokens with coordinates; bias the scores by distance instead

01 · First principlesPosition belongs to pairs, not tokens

Absolute schemes (learned tables, sinusoids) stamp a coordinate onto each token and trust the network to compute differences. But the quantity attention actually needs — "how far apart are the query and this key?" — is a property of the pair, and attention already has a place where every pair meets: the logit sij. So put position there.

sij = qi · kj / √d + b(i − j)
content score positional bias, a function of offset only

Tokens carry no position at all. The model sees "this key is 7 back" directly, never "this key is at 1840 and I am at 1847". The family of methods differs only in what the function b is.

02 · Why absolute breaksThe generalisation failure

Train an absolute-position model on sequences of length 1024 and ask it to handle 2048. Positions 1025–2048 are entries the embedding table has no rows for, or sinusoid readings the network never learned to decode. Worse, even within the trained range the model has learned position-specific quirks (row 512 means whatever the data at position 512 tended to look like), which is memorisation of coordinates, not understanding of order.

A relative scheme cannot make that mistake by construction: offset 7 at position 100 and offset 7 at position 100,000 are literally the same input to b. The statistics of small offsets are abundant at every training length, so the function b is well-trained wherever it is evaluated. Length generalisation stops being an extrapolation problem and becomes an interpolation one.

03 · Three instantiationsShaw, T5, ALiBi

Methodb(i − j) is…ParametersBeyond trained length
Shaw et al. 2018 learned embedding per offset, clipped at ±k; (originally added to keys, later simplified to a logit bias) 2k+1 vectors per layer clipping helps; offsets past k all map to the edge bucket
T5 buckets learned scalar per bucket: exact buckets for small offsets, log-spaced buckets for large ones, shared across layers ~32 scalars per head decent; far offsets share the "very far" bucket
ALiBi not learned at all: −m·|i − j|, a linear penalty with a fixed per-head slope m (slopes a geometric sequence across heads) zero best in class; the penalty extends to any distance by definition

The progression is a steady removal of machinery. Shaw learns a vector per offset; T5 compresses to a scalar per bucket; ALiBi deletes the learning entirely and hard-codes "nearer is more relevant", letting different heads decay at different rates so some stay long-sighted. That such a blunt prior works tells you most of what position information does in practice: it implements recency.

DISTANCE |i − j| → BIAS b → T5: learned steps, log-spaced buckets ALiBi: −m|i−j|, two heads shown offset 0

Two shapes of b. T5 learns a coarse staircase over offsets; ALiBi fixes a straight penalty whose slope varies by head.

04 · The tradeoffWhat the bias cannot express

A scalar bias added to the logit is position information of the cheapest kind: it can make distance globally attractive or repulsive, but it cannot make the content match itself depend on position. "Attend to the token 3 back if it is a verb" requires position to interact with the query and key vectors, which a fixed additive term cannot do (Shaw's vector form can, at more cost). RoPE sits at exactly this gap — it puts the offset inside the q·k product, where it multiplies content rather than being added beside it, which is a large part of why it won.

Engineering note: an explicit b(i−j) is an N×N term that must be materialised or fused into the attention kernel; ALiBi's regular structure fuses trivially, learned buckets need a gather. This kernel-friendliness, not just accuracy, decided which schemes survived the Flash-Attention era (see Flash Attention).
Mental Model