3. The Idea — pre-train then fine-tune

GPT-1’s approach can be stated simply:

Pre-train a large Transformer decoder on massive unlabelled text using next-word prediction
Fine-tune the same model on small labelled datasets — with almost no changes to the architecture

The magic is in the word “same.” The model that predicts the next word in a novel is also, after fine-tuning, the model that classifies sentiment, determines textual entailment, and answers questions. No separate models. No task-specific architectures. One base model, many applications.

Step 1: Pre-training — learning from unlabelled text

The pre-training objective is deceptively simple: given the previous k words, predict the next word.

If you are training on the sentence “Raju went to the market to buy vegetables,” the model sees “Raju went to the market to buy” and must predict “vegetables.” Then it sees “Raju went to the market to buy vegetables and” and must predict the next word (perhaps “came” or “forgot”). This repeats for every position in every sentence in the training corpus.

This objective:

Requires no labels. The supervision is the text itself.
Scales with data. More text → better predictions → better language understanding.
Forces the model to learn grammar (a verb usually follows a subject), facts (a market is where you buy things), and even some reasoning (if Raju “went to” somewhere, he probably comes back later).

GPT-1 was trained on BooksCorpus — approximately 7,000 self-published books scraped from the web, totalling around 800 million words. Books were chosen deliberately over web pages: books have long-range dependencies (a character introduced on page 1 reappears on page 200), and this forced the model to learn to track information over long distances.

The model architecture: 12 Transformer decoder layers, 768 dimensions, 12 attention heads. Masked self-attention means each token can only attend to previous tokens (not future ones) — enforcing the causal structure of next-word prediction. Total parameters: ~117 million.

Step 2: Fine-tuning — adapting to specific tasks

After pre-training, the model has a rich internal representation of language. Fine-tuning converts this general representation into task-specific behaviour.

For a classification task (e.g., sentiment analysis):

Add a linear layer on top of the final transformer block’s output
Feed labelled examples through the model
Train with cross-entropy loss on the correct label
Keep learning rate small — you are adjusting the weights, not relearning from scratch

Crucially: the transformer weights are updated during fine-tuning. They are not frozen. The pre-trained knowledge is adjusted, not discarded. This is different from just using pre-trained vectors as fixed features.

The input transformation trick — no architecture changes

Here is the most elegant part of GPT-1’s design. Different NLP tasks have different input structures:

Classification: one piece of text → one label
Entailment: two sentences (premise, hypothesis) → entails / contradicts / neutral
Similarity: two sentences → similarity score (0 or 1)
Multiple choice: question + several answer options → pick the correct answer

A naïve approach would design a different model for each input structure. GPT-1 does the opposite: it transforms all input structures into a single linear token sequence that the pre-trained model already knows how to process.

Special tokens are inserted as separators and markers:

Classification:
  [START] text [EXTRACT] → linear → P(label)

Entailment:
  [START] premise [DELIM] hypothesis [EXTRACT] → linear → P(entails/neutral/contradicts)

Similarity (symmetric):
  [START] text₁ [DELIM] text₂ [EXTRACT]     → linear → averaged output → P(similar)
  [START] text₂ [DELIM] text₁ [EXTRACT]     /

Multiple Choice (one pass per answer):
  [START] context [DELIM] answer_i [EXTRACT] → linear → score_i
  Final answer = argmax over scores

The [START], [DELIM], and [EXTRACT] tokens are added to the vocabulary during fine-tuning. The model learns what they mean from the labelled data. But the 12-layer transformer itself is unchanged — it still processes sequences of tokens and produces context-aware representations.

The combined training objective

During fine-tuning, the authors found that keeping the language modelling objective active alongside the task-specific objective improved results. The intuition: if you only train on the task loss, the model might forget its general language knowledge. The language modelling term acts as a regulariser.

L_total = L_task  +  λ · L_language_model

Where λ = 0.5. The task loss drives performance. The language modelling loss keeps the representations general.

The Indian analogy (extended)

Think of a student who spent three years reading everything — fiction, science, history, current affairs. They developed a deep, general understanding of how language works, how arguments are structured, and how events relate to causes and consequences.

Now the Board exams approach.

Fine-tuning is the sprint before each exam. The student reads the specific syllabus, does past papers, gets feedback. But they are not learning language from scratch — they are applying a general understanding they already have to a specific format.

The input transformation trick is like the student quickly reading the exam instructions before answering: “Oh, this is a comprehension question — I need to read both paragraphs and find the relationship. I know how to do that.” They adapt their existing knowledge to the task format, rather than building a new skill from scratch for each exam.

The student who read nothing for three years and only studied the syllabus might learn to pass one specific exam. But the broadly-read student, after the same one-week sprint, performs better — because they bring much more background knowledge.

GPT-1 is the broadly-read student.

Why decoder-only, not encoder-decoder?

GPT-1 uses only the decoder half of the Transformer (from Paper 08). This means masked self-attention — each token can only see tokens before it, not after.

For next-word prediction, this is exactly right. You predict word t using words 1 through t−1. Seeing word t+1 or later would be cheating — the model would just copy the answer.

BERT (Paper 11) will make the opposite choice: use the encoder (bidirectional attention) and a masked word prediction objective. This gives BERT stronger contextual understanding in each direction. The trade-off: BERT cannot generate text autoregressively. GPT-1 can. This architectural choice shaped the entire next decade of NLP.