Gated linear recurrences mixed with local attention
01 · First principlesThe inference bill nobody wants to pay
A transformer generating token N must hold keys and values for all N previous tokens. The KV cache grows linearly with context and, past a point, it — not the weights — is what fills the GPU and bounds batch size; per-token latency grows with context length too. An RNN pays none of this: its state is a fixed-size vector, so token one million costs exactly what token ten cost.
So why did RNNs lose? Two failures: training is sequential (no teacher-forced parallelism, so wall-clock training is slow), and a fixed-size state must compress the past, so exact retrieval of a name from 50 tokens ago — trivial for attention — is unreliable. The question Griffin answers: how much of each machine do you actually need, and for what?
02 · Failure modesEach pure architecture breaks somewhere
Pure transformer
Exact retrieval anywhere in context, parallel training. But KV cache grows O(N) and per-token inference cost grows with context. Long-context serving is a memory problem before it is a quality problem.
Pure recurrence (incl. Mamba)
O(1) state, constant-cost inference. But everything must squeeze through a fixed-size bottleneck: precise copying, induction, and needle-retrieval over long ranges degrade — lossy compression by construction.
The key empirical observation behind hybrids: most of what attention is used for in language is local — resolving a pronoun, copying a nearby name, finishing a syntactic pattern. Global, exact, arbitrary-range retrieval is the rare case. Paying O(N) cache for every layer buys global retrieval everywhere, and mostly wastes it.
03 · Mechanism IThe RG-LRU: a gated linear recurrence
Griffin's recurrent block is the Real-Gated Linear Recurrent Unit. Strip the tanh out of an RNN and the recurrence becomes linear and elementwise:
Linear + elementwise: no tanh through time means no vanishing-gradient squashing, stable training, and an efficient (parallelisable) scan implementation.
Input-dependent gate at: the decay rate is computed from the current token, so the unit can choose, per channel, to hold information for thousands of steps (a ≈ 1) or flush it instantly (a ≈ 0). This data-dependent gating is the same move that separates Mamba from earlier S4-style models, whose dynamics were fixed; see Transformer vs RNN vs S4.
The √(1 − a²) factor keeps the state's scale constant whether the gate holds or flushes — a small term doing quiet stability work.
04 · Mechanism IILocal attention, and the interleave
The second ingredient is ordinary multi-head attention restricted to a sliding window of W tokens (Griffin used 1024). Its KV cache is capped at W entries regardless of context length — O(1) in N, like the recurrence. Griffin interleaves the two block types (a repeating pattern of two recurrent blocks, then one local-attention block), and the division of labour is clean:
RG-LRU layers: long-range gist, compressed state · local attention layers: exact recall within the window
Blue covers everything but lossily; terracotta covers a bounded window but exactly. Stacked, they cover most of what language needs.
05 · PositionAgainst Mamba and the transformer
Transformer
Mamba (pure SSM)
Griffin (hybrid)
Training
parallel
parallel (scan)
parallel
Inference state
O(N) KV cache
O(1)
O(1)
Exact local recall
yes
unreliable
yes (within window)
Exact recall beyond window
yes
no
no — the honest cost
Length extrapolation
poor untreated
good
good
Griffin matched Llama-2-class transformer quality at the scales tested while training on several times fewer tokens, and it extrapolates to sequences longer than those trained on (the recurrence does not care how long the line is). The honest cost is the last row above: a needle planted 50,000 tokens back, outside every attention window, must survive in the compressed recurrent state, and exactness there is not guaranteed.
The hybrid bet aged well. Recurrent-plus-local-attention layouts reappear in Jamba, RecurrentGemma (Griffin productised), and several production long-context models. The architectures stopped competing and started sharing layers.
Mental Model
The transformer's inference problem is the KV cache; the RNN's quality problem is lossy compression. Each is the other's cure.
RG-LRU = linear elementwise recurrence with an input-dependent decay gate: trainable in parallel, holds or flushes per token.
Local sliding-window attention supplies the one thing recurrence lacks — exact recall — where it is most needed: nearby.
Total inference state is O(1) in context length; that is the entire economic argument.
What you give up: guaranteed exact retrieval beyond the window.