Memory as a loop, and the gradient that would not survive it
A sequence model must let the past influence the present: the meaning of a word depends on the sentence so far, the next sensor reading on the trajectory so far. Feeding a fixed window into a dense net fails twice — the window length caps the memory, and (as with images in CNNs) a pattern learned at position 5 teaches nothing about position 50.
The recurrent answer: keep a hidden state h, a running summary of everything seen, and update it with the same function at every step:
Weight sharing across time is the move — the convolution idea rotated ninety degrees. One transition function handles sequences of any length, and "what to remember" is learned rather than designed.
Training unrolls the loop into a T-layer feedforward net (one layer per time step, all sharing weights) and backpropagates through it — backpropagation through time. Here the shared weights turn poisonous. The gradient of a loss at step T with respect to the state at step t passes through every intermediate transition:
A deep feedforward net multiplies different Jacobians, whose effects can partially cancel. An RNN multiplies essentially the same one, so the product is governed by the spectral radius of Wh: below 1 the gradient shrinks geometrically (vanishing — the network cannot learn dependencies more than ~10–20 steps long, because the teaching signal from the future arrives microscopically small); above 1 it blows up (exploding — fixable by gradient clipping, which is why exploding is the lesser disease). Full treatment in exploding/vanishing gradients.
If multiplication along the time path is the disease, the cure is to give memory a path that is additive. The LSTM adds a second state — the cell state ct, a conveyor belt running through time — updated by addition, with three learned sigmoid gates (each a function of ht−1 and xt, outputting values in [0,1]) deciding what gets on and off the belt:
Now ∂ct/∂ct−1 ≈ ft, a learned value the network can hold near 1 for as long as a memory matters. Gradients flow back along the cell state through + and a gentle elementwise gate, not through a repeated matrix multiply — the same additive-highway insight as ResNet skip connections, discovered for time before it was discovered for depth. The GRU is the lighter cousin: it merges cell and hidden state and uses two gates instead of three, training faster and matching LSTMs on most tasks.
Memory rides the green belt and is edited by × f and + i. Gradients flowing right-to-left along the belt meet addition, not repeated matrix multiplication.
LSTMs ruled sequence modelling for two decades (translation, speech, handwriting), but two limits were structural, not bugs:
Transformers solved both at once — attention reads any past position directly (no squeezing) and trains on all positions in parallel — at the price of O(T²) cost, which is the opening that state space models later attacked. The full three-way comparison lives in LLM vs RNN vs S4.
The recurrent form never died: O(1) memory per step and O(T) total compute is exactly what streaming and edge inference want, which is why Mamba-style selective SSMs and hybrid models (see Griffin) are recurrences with better-behaved dynamics. And the LSTM's deepest idea — protect the gradient path with addition and gating — survives everywhere: ResNets, highway networks, and the residual stream of every transformer.