1. Context — NLP in 2018 and the labelled data bottleneck

By early 2018, the NLP community had made enormous progress on specific, well-defined tasks — but that progress had a hidden cost that was becoming hard to ignore.

The Transformer (2017) had solved the architecture problem. Attention mechanisms replaced recurrent networks and produced dramatically better results on translation, parsing, and text generation. The machinery was good. What was missing was the knowledge — the broad understanding of language that makes a system useful across many tasks rather than just one.

Here was the core problem: every NLP task required its own labelled dataset. Teaching a model to classify sentiment meant gathering thousands of movie reviews manually tagged as positive or negative. Teaching it to answer questions meant collecting question-answer pairs written by annotators. Teaching it to determine whether one sentence entails another (textual entailment) meant paying linguists to tag sentence pairs. Each task had its own dataset, its own model trained from scratch, and its own narrow expertise that did not transfer anywhere else.

This labelled-data bottleneck created two serious problems.

First, it was expensive. Annotating data at the scale needed for good performance — tens of thousands to millions of examples — required real money, time, and domain expertise. Smaller research groups and companies in developing countries simply could not afford it.

Second, it was brittle. A sentiment model trained on English movie reviews learned nothing that would help it understand a legal contract. A question-answering model trained on Wikipedia knew nothing about product manuals. Every new task started from zero.

The contrast with human learning was striking. A person who reads extensively — novels, newspapers, science articles, government reports — builds a general understanding of language, logic, and the world that transfers effortlessly to new tasks. Ask them to classify a movie review and they do it immediately. Ask them to answer questions about a text they just read and they manage. They do not need millions of examples per task. They need a small number of examples to understand what is being asked, then they apply their pre-existing knowledge.

The key question of 2018: could a neural network acquire the same kind of general, transferable language understanding from unlabelled text? And if it did, how much would that general knowledge reduce the labelled data requirement for specific tasks?

The answer from Radford et al.: yes, and dramatically so — if you train the right model on enough text with the right objective.

There was one important predecessor to understand. Word2Vec (2013) had shown that even simple neural models, trained to predict words from their context, learned meaningful vector representations of words — representations that transferred across tasks. LSTM-based language models had extended this to sequences. ELMo (Peters et al., early 2018, just before GPT-1) showed that contextualised word representations — where the same word gets different embeddings depending on context — could improve many downstream tasks substantially.

But all of these were still feature extractors. You trained them to get better word vectors, then plugged those vectors into task-specific models. The task-specific model still needed to be designed, trained, and tuned separately.

GPT-1’s contribution was to go further: train a complete model — not just word representations, but a full sequence processor capable of reasoning about long-range dependencies — and show that this complete model could be fine-tuned directly on a task with almost no architectural changes. The scale of transfer was the surprise.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever at OpenAI wrote the paper that established this paradigm. Their timing was not lucky. The Transformer architecture had just matured. GPU memory had grown enough to train large models. BooksCorpus — 7,000 unpublished novels scraped from the internet — provided high-quality, diverse English text in large quantity. The pieces came together in 2018.