9. Summary

The one-paragraph version

GPT-1 (Radford et al., 2018) pre-trained a 12-layer Transformer decoder on 800 million words of book text using a single objective: predict the next word. No labels needed — the text itself provides supervision. After pre-training, the same model was fine-tuned on small labelled datasets for specific NLP tasks (sentiment, entailment, question answering) using a simple input transformation: wrap the task’s input in [START]…[EXTRACT] markers, with [DELIM] between parts if there are multiple segments. No architectural changes. The pre-trained model’s general language knowledge transfers to each task, making it competitive with — and often better than — task-specific models trained from scratch with far more labelled data. This established the pre-train + fine-tune paradigm that underpins every major language model in use today.

What changed vs. before GPT-1

Before GPT-1	After GPT-1
One model per task	One base model, many tasks
Train from scratch on labelled data	Pre-train on unlabelled, fine-tune on labelled
Task-specific architecture	One architecture + input transformation
Narrow, task-specific knowledge	General language understanding that transfers

The three key equations

Pre-training (maximise):
  L₁(U) = Σᵢ log P(uᵢ | uᵢ₋ₖ,...,uᵢ₋₁ ; Θ)

Fine-tuning (maximise):
  L₂(C) = Σ log P(y | x¹,...,xᵐ)

Combined (maximise):
  L₃(C) = L₂(C) + λ · L₁(C)   with λ = 0.5

The Indian analogy (one line)

A student who reads everything for three years beats the one who crammed only the exam syllabus — after just one week of focused revision per subject.

What comes next

BERT (Paper 11) — published two months later — takes the bidirectional encoder approach: every token sees every other token. Better for understanding tasks where the full input is given. The GPT vs. BERT debate shapes NLP for the next three years.

GPT-3 (Paper 12) — 175 billion parameters. Shows that large enough models can perform tasks from examples in the prompt, without any fine-tuning at all. Changes the usage model from “fine-tune per task” to “prompt engineering.”

RLHF / InstructGPT (Paper 15) — adds human feedback on top of the GPT paradigm, making models follow instructions rather than just predict text. The step that turns language models into assistants.

Concepts introduced in this paper

Autoregressive language model — predicts the next token from all previous tokens
Causal (masked) self-attention — each token attends only to past tokens
Pre-training — unsupervised learning on large unlabelled text
Fine-tuning — supervised adaptation of a pre-trained model
Input transformation — reshaping task inputs to match pre-training format
Combined loss — task loss + language model loss weighted sum
BooksCorpus — 7,000 unpublished books used as GPT-1’s training data
[START] / [DELIM] / [EXTRACT] — special tokens for task conditioning

← Paper 09 — Mixture of Experts → Paper 11 — BERT