8. Limitations — why LSTMs eventually had to go
The LSTM ruled sequence modelling for twenty years. By 2017 it was dethroned. To understand why — and why the transformer was inevitable — we need to look honestly at what LSTMs could never do well.
Limitation 1 — they are strictly sequential
To compute the hidden state at step t, the LSTM needs the hidden state from step t−1. Which needs t−2. Which needs t−3. And so on.
This means training cannot be parallelised across time. Even on a GPU with thousands of cores, the LSTM has to process step 1 before step 2 before step 3. The GPU ends up mostly idle, waiting.
Indian-life analogy
Imagine an office with 100 clerks who all want to help, but the filing system requires that file t be signed by the clerk who signed file t−1. Every clerk except one is sitting around doing nothing. Throw more clerks at the problem and throughput doesn’t improve.
This is why LSTMs, for all their sophistication, had a hard ceiling on training speed. Doubling your GPU budget didn’t double your training speed. It only marginally helped.
Limitation 2 — the bottleneck of the hidden state
Every sequence, no matter how long, has to be compressed into a single fixed-size hidden state (and cell state). A 20-word sentence and a 2,000-word essay end up squeezed into the same 512-dimensional vector.
The longer the input, the more the network has to “forget” to make room. This showed up most obviously in machine translation: LSTMs handled 15-word sentences beautifully but struggled with 50-word ones, and fell apart on paragraphs.
The 2014 Bahdanau attention paper (Paper 07) was the first crack in this bottleneck — a way for the decoder to look back at all encoder hidden states instead of relying on the final one. Attention was invented as a patch for the LSTM’s information bottleneck. Three years later, the transformer would realise: if attention is doing most of the work, why bother with the LSTM at all?
Limitation 3 — still-limited effective range
Even with LSTMs, gradients don’t travel forever. In practice, an LSTM can reliably carry information about 200 to 500 time steps. That’s enough for a sentence, a paragraph, maybe a short document. It is not enough for an entire book, a day of stock ticks, or the long-range dependencies in DNA.
LSTMs push the ceiling from “5 steps” to “500 steps”. A big win. But 500 is still not ∞, and many real problems needed more.
Limitation 4 — they are hard to interpret
The cell state is a vector of real numbers. Nobody can point at slot 47 and say “that’s where the subject of the sentence is stored”. The information is entangled across slots in ways we cannot easily inspect.
Transformers, as we’ll see later, are also not trivially interpretable — but their attention weights at least let us see which input position a prediction depends on. LSTMs offer no such view. For safety-critical systems (medical, legal, financial), this opacity was a constant source of anxiety.
Limitation 5 — gate wastage
The LSTM’s four gates sound elegant, but each of them is learned separately with its own set of weights. On many tasks, researchers discovered that the gates were redundant. The GRU (Gated Recurrent Unit, Cho et al. 2014) proved that you could combine the forget and input gates and drop the cell state entirely, and performance barely changed. This was a hint that the LSTM architecture was over-engineered for what it actually did.
Limitation 6 — it never truly scaled
The great discovery of the transformer era was that models get dramatically better as you scale them up: more parameters, more data, more compute. LSTMs did not obey this rule cleanly. Beyond a few hundred million parameters, they hit diminishing returns — partly because you couldn’t train them fast enough, partly because the bottleneck (limitation 2) got worse with scale.
Without scaling, no GPT. Without GPT, no modern AI boom. The future belonged to whatever architecture could scale — and it wasn’t the LSTM.
Putting the limitations together
The LSTM was a brilliantly designed memory for a world where:
- Sequences were short (under 500 steps).
- You had one GPU, not thousands.
- Models had millions of parameters, not billions.
When all three of those assumptions broke at roughly the same time (around 2017), the LSTM ran out of road. Its replacement, the transformer, throws away recurrence entirely and uses attention to connect every position with every other position in parallel. That is Paper 08.
A fair final word
It is tempting to talk about LSTMs as “the old thing we replaced”. But they are still the right tool in several corners of AI:
- Tiny models on embedded devices. Transformers are expensive; a small LSTM running on an Arduino or ESP32 is often a better fit for keyword spotting or predictive text on a smart watch.
- Streaming data where latency matters. An LSTM can produce one output per step, instantly. A transformer usually waits for some context window.
- Problems with strong sequential causality. Control systems, physical simulations, certain kinds of financial data — places where the answer really does only depend on a short recent past.
LSTMs did not disappear. They moved from centre stage to supporting role, which is roughly what happens to every successful idea in research: it becomes so obvious that people stop naming it, and it keeps quietly doing its job.
Next: what came after LSTMs.