The Problem: task-specific models cannot share knowledge — Improving Language Understanding by Generative Pre-Training

To understand why GPT-1 mattered, you need to feel the frustration of the approach it replaced.

The status quo: one model per task

In 2017–2018, the standard NLP workflow looked like this:

Task A (Sentiment)        Task B (Entailment)        Task C (QA)
      │                          │                       │
 Labelled data A           Labelled data B          Labelled data C
      │                          │                       │
 Model A (LSTM)            Model B (BiLSTM+Attention)  Model C (Transformer)
      │                          │                       │
 Only works on A           Only works on B          Only works on C

Each task had its own pipeline. The models never communicated with each other. Knowledge about how negation works (relevant to sentiment) never helped the entailment model. Understanding that “John hit Mary” and “Mary was hit by John” mean the same thing (relevant to question answering) never informed the sentiment model.

This is not how human understanding works. Humans have one general language faculty that applies across tasks.

The cost of starting from scratch

Consider building a sentiment classifier for a new Indian language — say, Marathi or Tamil. The labelled data problem multiplies:

You need thousands of Marathi sentences annotated as positive/negative
Annotators who speak Marathi fluently (hard to hire at scale)
A separate model for Marathi, another for Tamil, another for Bengali
None of them benefit from what the others learned

Even for English, maintaining separate models for 10 different tasks means 10 separate training runs, 10 separate hyperparameter searches, 10 sets of potential bugs, and 10 models to deploy and monitor.

The deeper problem: what are models actually learning?

When a sentiment model trained on movie reviews sees the sentence “The cinematography was breathtaking but the plot was predictable,” it correctly identifies this as mixed sentiment — but how? Has it learned something general about language, or has it memorised patterns specific to movie reviews?

If you take that model and ask it to determine whether “All birds can fly” entails “Penguins can fly,” it fails completely — not because the task is harder in principle, but because it was never exposed to that kind of reasoning during training.

The model learned a very narrow slice of what language understanding means.

The unlabelled data sitting unused

Here is the uncomfortable fact about 2018: there was essentially unlimited text available. Wikipedia, books, news articles, academic papers, forum discussions — billions of words of English text, freely available, capturing an enormous range of human knowledge, reasoning, and expression.

None of this text had task-specific labels. A Wikipedia article about the Indian Rebellion of 1857 did not come with entailment labels. A novel by Premchand did not come with sentiment tags. So this data was largely ignored by task-specific models, which needed the right kind of labels to learn anything.

The waste was enormous: a vast ocean of human knowledge, sitting unused because the community had not figured out how to learn from unlabelled text in a way that transferred to labelled tasks.

The research question GPT-1 answered

Can you:

Learn from unlabelled text at scale — without any task-specific labels
Acquire general-purpose language representations — not just word vectors, but deep reasoning capacity
Transfer that knowledge to specific tasks — with only a small amount of fine-tuning on labelled data
Beat models trained from scratch on those tasks — even though those models were purpose-built

The answer, it turned out, was yes to all four.

2. The Problem — task-specific models cannot share knowledge

The status quo: one model per task

The cost of starting from scratch

The deeper problem: what are models actually learning?

The unlabelled data sitting unused

The research question GPT-1 answered