LLMs

Transformer vs RNN vs S4

Two axes — parallel training and O(1) inference — and who gets which

01 · First principlesThe two things a sequence model must do cheaply

Every sequence architecture answers two cost questions, and they pull in opposite directions:

Training: given the whole sequence at once, can every position be processed in parallel? (GPUs are wasted otherwise.)
Inference: generating token N+1, how much state must be carried, and how much work done? Ideally O(1) in sequence length.

Parallel training wants each output to be a direct function of all inputs (no chain through time). Cheap inference wants the past compressed into a fixed-size summary. The tension is that a fixed-size summary is a chain through time. Each architecture is one resolution of this tension, and the comparison table below is the whole note — the rest is explanation.

02 · The tableCosts side by side

	RNN / LSTM	Transformer	S4 / SSM (Mamba)
Training over N tokens	sequential — O(N) dependent steps	parallel — one big matmul, O(N²) work	parallel — convolution / scan, O(N log N) or O(N)
Per-token inference	O(1) compute, O(1) state	O(N) compute, O(N) KV cache	O(1) compute, O(1) state
Memory of the past	lossy fixed vector; forced forgetting	exact — every token retrievable	lossy but structured; long-range by design
Gradient path, token 1 → N	length N (vanishing gradients)	length 1 (direct attention edge)	length 1 through the convolution view
Precise recall over long context	weak	strong (its defining skill)	weaker — state is finite by design

03 · RNNThe fixed-size summary, and its two failures

The RNN is the honest baseline: h_t = f(h_t−1, x_t). Inference is perfect — one vector of state, constant work per token. But training cannot be parallelised, because h_t does not exist until h_t−1 does; a million-token sequence is a million dependent steps regardless of how many GPUs you own. And learning long-range dependencies means pushing gradients through a million applications of f, where they vanish or explode (gating, as in LSTMs, softens this without solving it). The fixed-size h must also overwrite something whenever it learns something — forgetting is structural, not incidental.

04 · TransformerPay O(N) at inference to train in parallel

The transformer refuses to summarise. Every token attends directly to every earlier token, so training is one parallel pass (see causal attention for why the mask makes this legal), the gradient between any two positions travels one edge, and retrieval is exact — the information is all still there. The bill arrives at inference: nothing was compressed, so generating each new token means consulting all previous keys and values. The KV cache grows O(N) in memory and each step costs O(N) compute. Long-context serving cost is this line item, and engineering around it (Flash Attention, GQA, paged caches) is a small industry.

05 · S4 / SSMBoth, via a duality

A state-space model is a linear recurrence: h_t = A h_t−1 + B x_t, y_t = C h_t. Linearity is the key that unlocks both doors at once. Unroll the recurrence and y is a convolution of the input with a kernel built from powers of A:

y_t = Σ_j≤t C A^t−j B x_j ⇒ y = K̄ ∗ x

kernel entry for lag t−j one convolution: fully parallel

So the same model runs in two modes: at training time, compute the kernel and convolve (parallel, FFT-fast); at inference time, step the recurrence (O(1) per token, like an RNN). S4's contribution was making A well-conditioned for long range (the HiPPO parameterisation, so powers of A neither vanish nor explode); Mamba's was making A, B, C depend on the input — selective forgetting — while keeping a parallel scan for training.

The convolution/recurrence duality. Nonlinear RNNs cannot be unrolled into a kernel; linearity buys the second execution plan.

The tradeoff did not vanish — it moved. The state is still finite, so SSMs compress, and they measurably trail transformers on tasks that need exact retrieval of arbitrary distant tokens (long in-context lookups, verbatim copying). Hence the current hybrids, which interleave a few attention layers into an SSM stack: mostly-O(1) inference, with occasional exact-recall layers as insurance.

06 · Reading the tableNo free lunch, only relocation

Read the table by columns and a conservation law appears: nobody gets parallel training, O(1) inference, and exact memory. RNNs give up training speed, transformers give up inference cost, SSMs give up exact recall. Which sacrifice is right depends on the workload — which is why transformers own quality-critical chat and retrieval-heavy tasks, while SSM blood keeps entering models whose economics are dominated by long-sequence serving.

Mental Model

Judge any sequence model on two axes: parallel training, O(1) inference. Exact memory is the hidden third axis paying for both.
RNN: compresses the past into a fixed vector — cheap inference, sequential training, structural forgetting.
Transformer: compresses nothing — parallel training and exact recall, but O(N) KV cache and per-token compute at inference.
S4/SSM: a linear recurrence is also a convolution; train as the latter, infer as the former, and get both axes.
The price SSMs pay is finite state: weak exact retrieval. Hybrids exist because that weakness is real.