Recurrence between segments, so context survives the window edge
Attention costs O(N²), so a vanilla transformer trains on text chopped into fixed segments of, say, 512 tokens. Each segment is processed in isolation. The question that forces Transformer-XL to exist: what happens to information at the boundary between segments?
It is destroyed. Token 513 begins a new segment with no memory that tokens 1–512 ever existed, even if token 512 was the subject of its sentence. The literature calls this context fragmentation: the model's effective memory is not "the last 512 tokens" but "the tokens since the last arbitrary chop point," which near a boundary is almost nothing.
Doubling the segment length pays O(N²) for every doubling and merely moves the cliff; the boundary failure is structural, not a matter of size. What we actually want is what RNNs had all along — state that carries over — without giving up parallel training within a segment. So the fix is to graft recurrence onto the transformer, but at the level of segments rather than tokens.
While processing segment τ, every layer is allowed to attend not only to this segment's hidden states but also to the cached hidden states of the previous segment, held fixed:
SG is stop-gradient: the cache is read-only. Queries come only from the current segment (we do not recompute the past); keys and values span both. The stop-gradient is what keeps training affordable — backprop never unrolls into previous segments, so the cost per segment is unchanged, yet information still flows forward through the cache.
Each layer reaches one segment further back than the layer below, so an L-layer model sees O(L × segment) tokens — context compounds with depth.
At inference the same cache makes generation fast: slide forward by reusing cached states instead of re-encoding the window, which made evaluation up to ~1800× faster than the recompute-everything baseline.
Recurrence breaks the original positional scheme, and it is worth seeing precisely how. With absolute encodings, every segment stamps its tokens with positions 0…511. Now the cached segment and the current segment both carry the same stamps:
The model cannot tell "the 17th token of the previous segment" from "the 17th token of this one" (a catastrophe for any order-sensitive pattern). The information that actually matters to attention was never "where am I on an absolute ruler" but "how far apart are query and key."
So Transformer-XL rewrites the attention score to depend only on the offset i − j. Expanding the standard score with absolute embeddings gives four terms (content–content, content–position, position–content, position–position); the fix replaces every absolute position vector with a sinusoidal encoding of the relative distance Ri−j, plus two learned global bias vectors u, v that stand in for "the query's own position," which no longer needs to exist:
A distance of 600 tokens now means the same thing whether or not a segment boundary sits in between. Recurrence and relative positions are not two separate features; the second is what makes the first coherent. The broader family of relative schemes — Shaw, T5 buckets, ALiBi — is covered in Relative Positional Embeddings, and the modern descendant of this idea is RoPE.