General ML

RNNs / LSTMs

Memory as a loop, and the gradient that would not survive it

01 · First principlesSequences need a running summary

A sequence model must let the past influence the present: the meaning of a word depends on the sentence so far, the next sensor reading on the trajectory so far. Feeding a fixed window into a dense net fails twice — the window length caps the memory, and (as with images in CNNs) a pattern learned at position 5 teaches nothing about position 50.

The recurrent answer: keep a hidden state h, a running summary of everything seen, and update it with the same function at every step:

h_t = tanh( W_h h_t−1 + W_x x_t + b ) · y_t = g(h_t)

Weight sharing across time is the move — the convolution idea rotated ninety degrees. One transition function handles sequences of any length, and "what to remember" is learned rather than designed.

02 · The breakageThe same Jacobian, T times

Training unrolls the loop into a T-layer feedforward net (one layer per time step, all sharing weights) and backpropagates through it — backpropagation through time. Here the shared weights turn poisonous. The gradient of a loss at step T with respect to the state at step t passes through every intermediate transition:

∂h_T/∂h_t = ∏_s=t+1..T ∂h_s/∂h_s−1 ≈ (W_h^T−t) · (diagonal tanh′ terms)

the same matrix, multiplied T−t times

A deep feedforward net multiplies different Jacobians, whose effects can partially cancel. An RNN multiplies essentially the same one, so the product is governed by the spectral radius of W_h: below 1 the gradient shrinks geometrically (vanishing — the network cannot learn dependencies more than ~10–20 steps long, because the teaching signal from the future arrives microscopically small); above 1 it blows up (exploding — fixable by gradient clipping, which is why exploding is the lesser disease). Full treatment in exploding/vanishing gradients.

The diagnosis in one line: information must travel from step t to step T through T−t multiplications, and repeated multiplication by anything not exactly 1 destroys it.

03 · The fixLSTM: route memory through addition

If multiplication along the time path is the disease, the cure is to give memory a path that is additive. The LSTM adds a second state — the cell state c_t, a conveyor belt running through time — updated by addition, with three learned sigmoid gates (each a function of h_t−1 and x_t, outputting values in [0,1]) deciding what gets on and off the belt:

c_t = f_t ⊙ c_t−1 + i_t ⊙ c̃_t · h_t = o_t ⊙ tanh(c_t)

forget gateinput gateoutput gate

Now ∂c_t/∂c_t−1 ≈ f_t, a learned value the network can hold near 1 for as long as a memory matters. Gradients flow back along the cell state through + and a gentle elementwise gate, not through a repeated matrix multiply — the same additive-highway insight as ResNet skip connections, discovered for time before it was discovered for depth. The GRU is the lighter cousin: it merges cell and hidden state and uses two gates instead of three, training faster and matching LSTMs on most tasks.

Memory rides the green belt and is edited by × f and + i. Gradients flowing right-to-left along the belt meet addition, not repeated matrix multiplication.

04 · The ceilingWhat gates could not fix

LSTMs ruled sequence modelling for two decades (translation, speech, handwriting), but two limits were structural, not bugs:

Sequential training. h_t cannot be computed before h_t−1, so training cannot parallelise across the sequence length. GPUs are parallel machines; an architecture that must run step-by-step leaves most of the hardware idle, and at internet scale that is fatal.
Finite, lossy memory. The entire past is squeezed into one fixed-size vector. However well the gates triage, a long document does not fit; retrieving a detail from a thousand steps ago means it must have survived a thousand rounds of gate decisions made before its relevance was known.

Transformers solved both at once — attention reads any past position directly (no squeezing) and trains on all positions in parallel — at the price of O(T²) cost, which is the opening that state space models later attacked. The full three-way comparison lives in LLM vs RNN vs S4.

05 · StandingWhat survives

The recurrent form never died: O(1) memory per step and O(T) total compute is exactly what streaming and edge inference want, which is why Mamba-style selective SSMs and hybrid models (see Griffin) are recurrences with better-behaved dynamics. And the LSTM's deepest idea — protect the gradient path with addition and gating — survives everywhere: ResNets, highway networks, and the residual stream of every transformer.

Mental Model

An RNN is one transition function h_t = f(h_t−1, x_t) reused at every step: weight sharing across time.
BPTT multiplies the same Jacobian T times; spectral radius below 1 vanishes the gradient, above 1 explodes it (clip for the latter, redesign for the former).
LSTM = an additive cell-state highway with learned forget/input/output gates; gradients flow through + instead of ×.
GRU: same idea, two gates, one state, cheaper.
The unfixable limits — sequential training and a fixed-size memory — are what transformers, and later SSMs, were built to remove.