LLMs

Griffin

Gated linear recurrences mixed with local attention

01 · First principlesThe inference bill nobody wants to pay

A transformer generating token N must hold keys and values for all N previous tokens. The KV cache grows linearly with context and, past a point, it — not the weights — is what fills the GPU and bounds batch size; per-token latency grows with context length too. An RNN pays none of this: its state is a fixed-size vector, so token one million costs exactly what token ten cost.

So why did RNNs lose? Two failures: training is sequential (no teacher-forced parallelism, so wall-clock training is slow), and a fixed-size state must compress the past, so exact retrieval of a name from 50 tokens ago — trivial for attention — is unreliable. The question Griffin answers: how much of each machine do you actually need, and for what?

02 · Failure modesEach pure architecture breaks somewhere

Pure transformer
Exact retrieval anywhere in context, parallel training. But KV cache grows O(N) and per-token inference cost grows with context. Long-context serving is a memory problem before it is a quality problem.
Pure recurrence (incl. Mamba)
O(1) state, constant-cost inference. But everything must squeeze through a fixed-size bottleneck: precise copying, induction, and needle-retrieval over long ranges degrade — lossy compression by construction.

The key empirical observation behind hybrids: most of what attention is used for in language is local — resolving a pronoun, copying a nearby name, finishing a syntactic pattern. Global, exact, arbitrary-range retrieval is the rare case. Paying O(N) cache for every layer buys global retrieval everywhere, and mostly wastes it.

03 · Mechanism IThe RG-LRU: a gated linear recurrence

Griffin's recurrent block is the Real-Gated Linear Recurrent Unit. Strip the tanh out of an RNN and the recurrence becomes linear and elementwise:

ht = at ⊙ ht−1 + √(1 − at²) ⊙ (it ⊙ xt)
learned, input-dependent decay (recurrence gate) norm-preserving input scale

04 · Mechanism IILocal attention, and the interleave

The second ingredient is ordinary multi-head attention restricted to a sliding window of W tokens (Griffin used 1024). Its KV cache is capped at W entries regardless of context length — O(1) in N, like the recurrence. Griffin interleaves the two block types (a repeating pattern of two recurrent blocks, then one local-attention block), and the division of labour is clean:

RG-LRU layers: long-range gist, compressed state   ·   local attention layers: exact recall within the window
WHAT EACH LAYER TYPE SEES (GENERATING TOKEN t) t RG-LRU: entire history, compressed into fixed-size state h local attn: last W tokens, exact total inference state: O(1) in context length — d-dim h per recurrent layer + W-entry KV per attention layer

Blue covers everything but lossily; terracotta covers a bounded window but exactly. Stacked, they cover most of what language needs.

05 · PositionAgainst Mamba and the transformer

TransformerMamba (pure SSM)Griffin (hybrid)
Trainingparallelparallel (scan)parallel
Inference stateO(N) KV cacheO(1)O(1)
Exact local recallyesunreliableyes (within window)
Exact recall beyond windowyesnono — the honest cost
Length extrapolationpoor untreatedgoodgood

Griffin matched Llama-2-class transformer quality at the scales tested while training on several times fewer tokens, and it extrapolates to sequences longer than those trained on (the recurrence does not care how long the line is). The honest cost is the last row above: a needle planted 50,000 tokens back, outside every attention window, must survive in the compressed recurrent state, and exactness there is not guaranteed.

The hybrid bet aged well. Recurrent-plus-local-attention layouts reappear in Jamba, RecurrentGemma (Griffin productised), and several production long-context models. The architectures stopped competing and started sharing layers.
Mental Model