General ML

RNNs / LSTMs

Memory as a loop, and the gradient that would not survive it

01 · First principlesSequences need a running summary

A sequence model must let the past influence the present: the meaning of a word depends on the sentence so far, the next sensor reading on the trajectory so far. Feeding a fixed window into a dense net fails twice — the window length caps the memory, and (as with images in CNNs) a pattern learned at position 5 teaches nothing about position 50.

The recurrent answer: keep a hidden state h, a running summary of everything seen, and update it with the same function at every step:

ht = tanh( Wh ht−1 + Wx xt + b )   ·   yt = g(ht)

Weight sharing across time is the move — the convolution idea rotated ninety degrees. One transition function handles sequences of any length, and "what to remember" is learned rather than designed.

02 · The breakageThe same Jacobian, T times

Training unrolls the loop into a T-layer feedforward net (one layer per time step, all sharing weights) and backpropagates through it — backpropagation through time. Here the shared weights turn poisonous. The gradient of a loss at step T with respect to the state at step t passes through every intermediate transition:

∂hT/∂ht = ∏s=t+1..T ∂hs/∂hs−1  ≈  (WhT−t) · (diagonal tanh′ terms)
the same matrix, multiplied T−t times

A deep feedforward net multiplies different Jacobians, whose effects can partially cancel. An RNN multiplies essentially the same one, so the product is governed by the spectral radius of Wh: below 1 the gradient shrinks geometrically (vanishing — the network cannot learn dependencies more than ~10–20 steps long, because the teaching signal from the future arrives microscopically small); above 1 it blows up (exploding — fixable by gradient clipping, which is why exploding is the lesser disease). Full treatment in exploding/vanishing gradients.

The diagnosis in one line: information must travel from step t to step T through T−t multiplications, and repeated multiplication by anything not exactly 1 destroys it.

03 · The fixLSTM: route memory through addition

If multiplication along the time path is the disease, the cure is to give memory a path that is additive. The LSTM adds a second state — the cell state ct, a conveyor belt running through time — updated by addition, with three learned sigmoid gates (each a function of ht−1 and xt, outputting values in [0,1]) deciding what gets on and off the belt:

ct = ft ⊙ ct−1  +  it ⊙ c̃t   ·   ht = ot ⊙ tanh(ct)
forget gateinput gateoutput gate

Now ∂ct/∂ct−1 ≈ ft, a learned value the network can hold near 1 for as long as a memory matters. Gradients flow back along the cell state through + and a gentle elementwise gate, not through a repeated matrix multiply — the same additive-highway insight as ResNet skip connections, discovered for time before it was discovered for depth. The GRU is the lighter cousin: it merges cell and hidden state and uses two gates instead of three, training faster and matching LSTMs on most tasks.

CELL STATE c — THE ADDITIVE HIGHWAY c₍ₜ₋₁₎ cₜ × + forget f input i output o h₍ₜ₋₁₎ hₜ gates computed from h₍ₜ₋₁₎ and xₜ — values in [0,1]

Memory rides the green belt and is edited by × f and + i. Gradients flowing right-to-left along the belt meet addition, not repeated matrix multiplication.

04 · The ceilingWhat gates could not fix

LSTMs ruled sequence modelling for two decades (translation, speech, handwriting), but two limits were structural, not bugs:

Transformers solved both at once — attention reads any past position directly (no squeezing) and trains on all positions in parallel — at the price of O(T²) cost, which is the opening that state space models later attacked. The full three-way comparison lives in LLM vs RNN vs S4.

05 · StandingWhat survives

The recurrent form never died: O(1) memory per step and O(T) total compute is exactly what streaming and edge inference want, which is why Mamba-style selective SSMs and hybrid models (see Griffin) are recurrences with better-behaved dynamics. And the LSTM's deepest idea — protect the gradient path with addition and gating — survives everywhere: ResNets, highway networks, and the residual stream of every transformer.

Mental Model