Section 09

Summary

Improving Language Understanding by Generative Pre-Training 2018

9. Summary

The one-paragraph version

GPT-1 (Radford et al., 2018) pre-trained a 12-layer Transformer decoder on 800 million words of book text using a single objective: predict the next word. No labels needed — the text itself provides supervision. After pre-training, the same model was fine-tuned on small labelled datasets for specific NLP tasks (sentiment, entailment, question answering) using a simple input transformation: wrap the task’s input in [START]…[EXTRACT] markers, with [DELIM] between parts if there are multiple segments. No architectural changes. The pre-trained model’s general language knowledge transfers to each task, making it competitive with — and often better than — task-specific models trained from scratch with far more labelled data. This established the pre-train + fine-tune paradigm that underpins every major language model in use today.


What changed vs. before GPT-1

Before GPT-1After GPT-1
One model per taskOne base model, many tasks
Train from scratch on labelled dataPre-train on unlabelled, fine-tune on labelled
Task-specific architectureOne architecture + input transformation
Narrow, task-specific knowledgeGeneral language understanding that transfers

The three key equations

Pre-training (maximise):
  L₁(U) = Σᵢ log P(uᵢ | uᵢ₋ₖ,...,uᵢ₋₁ ; Θ)

Fine-tuning (maximise):
  L₂(C) = Σ log P(y | x¹,...,xᵐ)

Combined (maximise):
  L₃(C) = L₂(C) + λ · L₁(C)   with λ = 0.5

The Indian analogy (one line)

A student who reads everything for three years beats the one who crammed only the exam syllabus — after just one week of focused revision per subject.


What comes next

BERT (Paper 11) — published two months later — takes the bidirectional encoder approach: every token sees every other token. Better for understanding tasks where the full input is given. The GPT vs. BERT debate shapes NLP for the next three years.

GPT-3 (Paper 12) — 175 billion parameters. Shows that large enough models can perform tasks from examples in the prompt, without any fine-tuning at all. Changes the usage model from “fine-tune per task” to “prompt engineering.”

RLHF / InstructGPT (Paper 15) — adds human feedback on top of the GPT paradigm, making models follow instructions rather than just predict text. The step that turns language models into assistants.


Concepts introduced in this paper

  • Autoregressive language model — predicts the next token from all previous tokens
  • Causal (masked) self-attention — each token attends only to past tokens
  • Pre-training — unsupervised learning on large unlabelled text
  • Fine-tuning — supervised adaptation of a pre-trained model
  • Input transformation — reshaping task inputs to match pre-training format
  • Combined loss — task loss + language model loss weighted sum
  • BooksCorpus — 7,000 unpublished books used as GPT-1’s training data
  • [START] / [DELIM] / [EXTRACT] — special tokens for task conditioning

Paper 09 — Mixture of Experts    → Paper 11 — BERT

🎉 You've finished this paper!