Two axes — parallel training and O(1) inference — and who gets which
Every sequence architecture answers two cost questions, and they pull in opposite directions:
Parallel training wants each output to be a direct function of all inputs (no chain through time). Cheap inference wants the past compressed into a fixed-size summary. The tension is that a fixed-size summary is a chain through time. Each architecture is one resolution of this tension, and the comparison table below is the whole note — the rest is explanation.
| RNN / LSTM | Transformer | S4 / SSM (Mamba) | |
|---|---|---|---|
| Training over N tokens | sequential — O(N) dependent steps | parallel — one big matmul, O(N²) work | parallel — convolution / scan, O(N log N) or O(N) |
| Per-token inference | O(1) compute, O(1) state | O(N) compute, O(N) KV cache | O(1) compute, O(1) state |
| Memory of the past | lossy fixed vector; forced forgetting | exact — every token retrievable | lossy but structured; long-range by design |
| Gradient path, token 1 → N | length N (vanishing gradients) | length 1 (direct attention edge) | length 1 through the convolution view |
| Precise recall over long context | weak | strong (its defining skill) | weaker — state is finite by design |
The RNN is the honest baseline: ht = f(ht−1, xt). Inference is perfect — one vector of state, constant work per token. But training cannot be parallelised, because ht does not exist until ht−1 does; a million-token sequence is a million dependent steps regardless of how many GPUs you own. And learning long-range dependencies means pushing gradients through a million applications of f, where they vanish or explode (gating, as in LSTMs, softens this without solving it). The fixed-size h must also overwrite something whenever it learns something — forgetting is structural, not incidental.
The transformer refuses to summarise. Every token attends directly to every earlier token, so training is one parallel pass (see causal attention for why the mask makes this legal), the gradient between any two positions travels one edge, and retrieval is exact — the information is all still there. The bill arrives at inference: nothing was compressed, so generating each new token means consulting all previous keys and values. The KV cache grows O(N) in memory and each step costs O(N) compute. Long-context serving cost is this line item, and engineering around it (Flash Attention, GQA, paged caches) is a small industry.
A state-space model is a linear recurrence: ht = A ht−1 + B xt, yt = C ht. Linearity is the key that unlocks both doors at once. Unroll the recurrence and y is a convolution of the input with a kernel built from powers of A:
So the same model runs in two modes: at training time, compute the kernel and convolve (parallel, FFT-fast); at inference time, step the recurrence (O(1) per token, like an RNN). S4's contribution was making A well-conditioned for long range (the HiPPO parameterisation, so powers of A neither vanish nor explode); Mamba's was making A, B, C depend on the input — selective forgetting — while keeping a parallel scan for training.
The convolution/recurrence duality. Nonlinear RNNs cannot be unrolled into a kernel; linearity buys the second execution plan.
The tradeoff did not vanish — it moved. The state is still finite, so SSMs compress, and they measurably trail transformers on tasks that need exact retrieval of arbitrary distant tokens (long in-context lookups, verbatim copying). Hence the current hybrids, which interleave a few attention layers into an SSM stack: mostly-O(1) inference, with occasional exact-recall layers as insurance.
Read the table by columns and a conservation law appears: nobody gets parallel training, O(1) inference, and exact memory. RNNs give up training speed, transformers give up inference cost, SSMs give up exact recall. Which sacrifice is right depends on the workload — which is why transformers own quality-critical chat and retrieval-heavy tasks, while SSM blood keeps entering models whose economics are dominated by long-sequence serving.