Paper 10
Intermediate

Improving Language Understanding by Generative Pre-Training

Paper 10 — Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever · OpenAI · 2018


What this paper did

It proved that a single pre-trained model, fine-tuned with minimal changes, could beat purpose-built models across a wide range of language tasks.

Before GPT-1, the standard approach to NLP was: gather labelled data for your specific task (sentiment, question answering, textual entailment), design a task-specific architecture, train it from scratch. This worked, but required expensive labelled datasets for every new task, and each model started with zero knowledge.

Radford et al. took the decoder half of the Transformer and pre-trained it on 800 million words of BooksCorpus using a single objective: predict the next word. No labels needed — the supervision comes from the text itself. After pre-training, they fine-tuned the same model on small labelled datasets with one key constraint: no changes to the architecture. They transformed the input to match the pre-training format instead.

The result beat state-of-the-art on 9 of 12 NLP benchmarks, including tasks the model was never explicitly designed for.

The key equations:

Pre-training loss:  L₁(U) = Σᵢ log P(uᵢ | uᵢ₋ₖ,...,uᵢ₋₁; Θ)

Fine-tuning loss:   L₂(C) = Σ log P(y | x¹,...,xᵐ)

Combined loss:      L₃(C) = L₂(C) + λ·L₁(C)

Where U is the unlabelled text corpus, C is the labelled downstream dataset, and λ is a small weight that keeps the language modelling objective active during fine-tuning.


The Indian analogy

Consider a student who, before the Board exams, spent three years reading every novel, newspaper, science magazine, and history book they could find. They never crammed any specific exam syllabus — they just read broadly and deeply.

Now, one month before the exam, they spend a week on each subject’s past papers (fine-tuning). Because they already understand how arguments are constructed (language), how stories develop (reasoning), and how facts relate (knowledge), they need very few practice examples to ace each specific test.

Contrast this with a classmate who started studying only when the syllabus was announced, with no prior reading. That classmate needs months of subject-specific coaching and still knows only what was explicitly taught.

GPT-1’s pre-training is the three years of broad reading. Fine-tuning is the one-month sprint. The pre-trained model starts with a head start that no task-specific model can match — because language understanding transfers across tasks.


Read in this order

SectionWhat you will learnDifficultyTime
1. ContextNLP in 2018 — the labelled data bottleneck🟢4 min
2. The ProblemWhy task-specific models fail to generalise🟢3 min
3. The IdeaPre-train on books, fine-tune on tasks — no architecture changes🟡5 min
4. The MathAutoregressive LM objective, fine-tuning loss, input transformations🔴10 min
5. Worked ExampleForward pass through GPT-1 on a sentiment classification task🔴8 min
6. The CodeCausal language model in NumPy; input transformation for classification🟡6 min
7. LimitationsUnidirectional context, no instruction following, fine-tuning still needs labels🟡4 min
8. ImpactGPT-2, GPT-3, and how GPT-1’s paradigm took over AI🟢4 min
9. SummaryOne-page recap🟢2 min

Also: Glossary · Quiz · Further Reading


Before you read: math tutorials you need


GPT-1 architecture at a glance

Input tokens (text + special markers)


 Token Embedding + Positional Embedding


 ┌───────────────────────────────────────┐
 │  Transformer Decoder Block × 12      │
 │                                      │
 │  Masked Multi-Head Self-Attention     │  ← causal: each token sees only past
 │  Feed-Forward Network                │
 │  Layer Norm + Residual               │
 └───────────────────────────────────────┘


 Linear layer → Softmax → P(next token)       [pre-training]
       OR
 Linear layer → Softmax → P(class label)      [fine-tuning]

The same 12-layer decoder handles both. No architecture changes between pre-training and fine-tuning — only the output head changes.


Paper 09 — Mixture of Experts    → Paper 11 — BERT

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.