One linear system, two faces: recurrence for inference, convolution for training
Sequence modelling after transformers had an unresolved tension. RNNs infer cheaply — O(1) state per step — but train sequentially and forget over long horizons. Transformers train in parallel and recall anything, but pay O(T²) attention and carry a cache that grows with the sequence. The question: is there a model that trains like a transformer and infers like an RNN?
State space models answer by going back to a much older object — the linear dynamical system of control theory — and noticing it has exactly the two faces required.
A hidden state x evolves continuously under A, is written to by the input u through B, and is read out through C. Discretise with a step size Δ (zero-order hold gives A̅ = exp(ΔA), with a matching B̅) and you get a sequence-to-sequence map ready for tokens.
Because the system is linear and time-invariant, the discretised model can be computed two ways, and this duality is the defining idea of the whole field.
Same weights, same outputs, two execution plans. Linearity is the price of admission: a tanh anywhere in the state update (as in an RNN) destroys the unrolling, and with it the convolutional face. S4's bet is that the nonlinearity can live between layers (mixing, gating) while the time axis itself stays linear.
Train through the convolutional face, deploy through the recurrent face; the parameters never know the difference.
A random A fails for the same reason RNNs fail: powers A̅j either decay to nothing or blow up, so the kernel K̅ is effectively short and the long-range memory is fictional. S4's contribution was a principled initialisation. The HiPPO matrix is derived by asking: what dynamics make the state x(t) the optimal compression of the input's history — concretely, its coefficients in a Legendre-polynomial basis, continuously updated as new input arrives?
With that A, the state is an online summary of the whole past by construction, and the kernel carries usable mass across thousands of steps (S4 was the first model to crack Path-X, at sequence length 16,384). The rest of S4 is numerical engineering — a structured (diagonal-plus-low-rank, later just diagonal in S4D/S5) parameterisation so that exp(ΔA) and the kernel are cheap and stable to compute.
Vanilla S4 excelled on audio and long continuous signals but lagged transformers on language. The diagnosis is structural: A̅, B̅, C are fixed, so every token is written into the state with the same dynamics, regardless of what the token is. The model is a time-invariant filter; it cannot do content-based addressing — "ignore this filler word, latch onto that name, recall it when the question arrives" — which is precisely what attention does natively and language constantly demands.
Selective SSMs (Mamba) repair this by making Δ, B, C functions of the current input. The dynamics now depend on content: the model can dilate time (large Δ writes firmly, small Δ lets the token glide past) and modulate what is written and read, token by token. The cost is exact: time-invariance is gone, so the convolutional face is gone. Mamba recovers training parallelism a different way — an associative parallel scan over the recurrence, fused into GPU-friendly kernels — keeping O(T) training and O(1) inference.
| RNN / LSTM | Transformer | SSM (S4 / Mamba) | |
|---|---|---|---|
| Training over length T | sequential | parallel, O(T²) | parallel, O(T log T) / scan |
| Inference state | O(1) | KV cache grows with T | O(1) |
| Long-range memory | gated, lossy | exact lookup | strong but compressed |
| Content-based recall | weak | native | only with selectivity (Mamba) |
The compressed-state column is both the selling point and the caveat: a fixed-size state cannot store an arbitrarily long context verbatim, so needle-in-haystack recall remains attention's home turf, and production systems increasingly hybridise.