Paper 04
Intermediate

Long Short-Term Memory

Long Short-Term Memory (LSTM) — Hochreiter & Schmidhuber, 1997

TL;DR

Backpropagation (Paper 03) gave us a way to train deep networks. But when those networks had to read a sequence — a sentence, a song, a stock ticker — the gradient silently died as it travelled back through time. The network could only “remember” the last three or four steps. Anything older was lost.

Hochreiter and Schmidhuber proposed a radical fix: don’t just pass a hidden state forward, also pass a protected memory line called the cell state. Three small neural-network gates — forget, input, output — decide what to erase, what to write, and what to read from this memory. Because the cell state flows almost unchanged from step to step, the gradient survives hundreds of time steps instead of five. LSTMs went on to power the first Google Translate, Siri, and almost every sequence model from 2000 to 2017.

The journey in one line

Deep networks could see → but they couldn’t remember → LSTMs gave them a notebook they could selectively update.

What you will learn

  1. Why a plain RNN forgets — the vanishing gradient, told in pictures.
  2. Why the XOR problem (Paper 02) foreshadowed this failure on sequences.
  3. The cell state — a student’s running notes for the neural network.
  4. The three gates — forget, input, output — and what each one decides.
  5. A worked numerical example: one LSTM step by hand.
  6. A 25-line PyTorch LSTM cell you can run on Google Colab.
  7. Why LSTMs ruled for two decades — and why Transformers eventually replaced them.

Sections

  1. Historical context — 1997, winter of neural nets, RNNs born and broken
  2. The problem — vanishing gradients and the XOR echo
  3. The core idea — a protected memory line + three gates
  4. How it works — one LSTM step, drawn out
  5. The math — equations + a fully worked numeric example
  6. The code — a minimal LSTM cell in PyTorch
  7. Impact — Google Translate, Alexa, DeepMind
  8. Limitations — why it couldn’t scale to GPT-size
  9. What came next — word embeddings, seq2seq, attention

Resources

  • Glossary — every new term used in this paper
  • Quiz — 5 questions to test your understanding
  • Further reading — blogs, videos, original paper

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.