Further reading — Paper 04

If this paper hooked you, here is a curated reading list. All of it is free. Start with the blog posts, watch the videos, then try the original paper at the end — by the time you get there, everything will read smoothly.

The original paper

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. PDF: https://www.bioinf.jku.at/publications/older/2604.pdf The paper is dense and uses notation heavier than necessary, but the core equations in Section 4 of the paper are exactly what we covered.
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU München. The original vanishing-gradient analysis, in German. Hard to find a clean English translation, but the paper above summarises its results.

Blog posts (highly recommended)

Christopher Olah — “Understanding LSTM Networks” (2015). https://colah.github.io/posts/2015-08-Understanding-LSTMs/ The gold-standard visual explanation of LSTMs. The diagrams alone are worth the visit.
Andrej Karpathy — “The Unreasonable Effectiveness of Recurrent Neural Networks” (2015). https://karpathy.github.io/2015/05/21/rnn-effectiveness/ A fun read with character-level LSTMs generating Shakespeare, C code, and Wikipedia articles. Explains RNNs and LSTMs side by side.
Edwin Chen — “Exploring LSTMs” (2017). https://blog.echen.me/2017/05/30/exploring-lstms/ Visualises what each cell state dimension actually learns. Makes the whole architecture feel less magical.

Videos

3Blue1Brown — “Neural Networks” series, parts 1–4. https://www.3blue1brown.com/topics/neural-networks Not specifically about LSTMs, but covers gradients and backpropagation beautifully. Watch this before attempting the paper.
StatQuest with Josh Starmer — “Long Short-Term Memory (LSTM)”. https://www.youtube.com/watch?v=YCzL96nL7j0 Goofy but clear. Walks through an LSTM step by step with exactly the kind of worked example we built in Section 5.
Andrej Karpathy — “Let’s build GPT: from scratch, in code, spelled out”. https://www.youtube.com/watch?v=kCc8FmEb1nY Not about LSTMs, but the first 30 minutes explain character-level sequence modelling and the bigram model, which is excellent context for why sequences matter.

Code and tutorials

PyTorch official LSTM tutorial. https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html Trains an LSTM part-of-speech tagger. Good hands-on exercise once you’ve finished our code section.
NumPy LSTM from scratch (gist by karpathy). https://gist.github.com/karpathy/d4dee566867f8291f086 Just NumPy, no PyTorch. If you want to see every matrix shape in the raw, this is the cleanest implementation.
Google Colab notebook accompanying this page. (Coming — we’ll link it once the site is live.) Will contain the 25-line cell from Section 6, plus the three exercises, with expected outputs.

Papers to read before GPT-era stuff

If your goal is to reach Transformers (Paper 08) with clear intuition, read in this order:

This paper (LSTM) — ✅ done.
Word2Vec (Paper 05) — word embeddings as input to sequence models.
Seq2Seq (Paper 06) — encoder-decoder LSTMs for translation.
Bahdanau Attention (Paper 07) — the first attention mechanism, invented to patch the LSTM bottleneck.
Attention Is All You Need (Paper 08) — the transformer.

Indian resources and community

AI4Bharat (IIT Madras). https://ai4bharat.org Indian-language NLP research lab. Many of their early translation models for Hindi, Tamil, Bengali, and other Indian languages used LSTM-based sequence-to-sequence architectures.
IISc Bangalore’s NPTEL course on Deep Learning. https://nptel.ac.in/courses/106106184 Free video lectures in English, taught by Prof. Mitesh Khapra. Covers LSTMs in Week 6.
IIT Madras — Deep Learning for Computer Vision (Prof. Vineeth Balasubramanian). https://nptel.ac.in/courses/106106224 Uses LSTMs in the video/sequence modelling sections.

Back to Paper 04 home · Glossary · Quiz.