Context: The State of NLP in 2020
By early 2020, the deep learning revolution in NLP was in full swing. But the field had settled into a pattern: fine-tuning.
The Fine-Tuning Paradigm
Here’s how NLP worked in 2019–2020:
- Pre-training: Train a large language model (like BERT, GPT-1, RoBERTa, ALBERT) on unlabeled data from the internet. This takes weeks or months and costs thousands of dollars in compute.
- Fine-tuning: For each new task—sentiment analysis, question answering, named-entity recognition, machine translation—download the pre-trained model and train it on labeled data specific to that task. This costs money and requires labeled data.
- Inference: Use the fine-tuned model on new examples.
This worked well. Models like BERT set new benchmarks on dozens of natural language understanding tasks. But it had two problems:
Problem 1: Cost and Data Hunger
Fine-tuning requires labeled training data. For a small company or researcher in a small town, labeling 1,000–10,000 examples is expensive. If you have 50 different tasks (customer support topics, content classification, intent detection), you need 50 separate labeled datasets and 50 separate models. This isn’t scalable.
Problem 2: The Task-Specific Model Tax
Once fine-tuned, each model is locked to a single task. A sentiment classifier trained on movie reviews doesn’t work on product reviews without retraining. A translation model trained for English→Hindi needs a separate model for Hindi→English. You end up deploying hundreds of models, each a frozen specialization.
What GPT-1 Showed
In June 2018, OpenAI released GPT-1 (the original Generative Pre-trained Transformer). It was a 117-million-parameter decoder-only model trained on a massive corpus. The paper included an intriguing finding: even without fine-tuning, GPT-1 achieved reasonable performance on many NLP tasks. This hinted at something nobody was fully exploring: what if you could do tasks without fine-tuning at all?
But GPT-1 was small. Its zero-shot performance was weaker than fine-tuned BERT. The field largely ignored the signal and continued with fine-tuning.
Scale Was Everywhere Else
By 2020, scale was already reshaping machine learning:
- Computer vision: ImageNet images increased. ResNet and then Vision Transformers scaled up.
- Speech: Wav2Vec and similar models showed that larger models trained on more unlabeled audio improved downstream tasks.
- Recommendation systems: Scaling embedding dimensions and hidden layers improved ranking quality.
But in NLP, the narrative was still: “Once you go above 1–2 billion parameters, fine-tuning becomes the bottleneck, not model size.” Most researchers assumed a model of 10 or 20 billion parameters would have diminishing returns. The field was pessimistic about scale.
The Technical Setup: Transformers
By 2020, the Transformer architecture (from “Attention Is All You Need,” Vaswani et al., 2017) was mature:
- Self-attention: Let each token attend to all previous tokens (in decoder-only models).
- Scaled dot-product attention: Compute similarity between queries and keys, then weight the values.
- Multi-head attention: Run multiple attention operations in parallel, then combine.
- Feedforward networks: After attention, apply two fully-connected layers with a nonlinearity (ReLU or GELU).
- Layer normalization and residual connections: Stabilize training.
The decoder-only variant (used in GPT-1) made sense for language modeling: predict the next token given all previous tokens. This is causal language modeling (you only look backward).
Why Scale Hadn’t Been Tried
Training a 175-billion-parameter model requires:
- Massive compute: At the time, only a handful of labs had the GPU cluster capacity. A single training run could cost $5–$10 million in compute.
- Massive data: You need hundreds of billions of tokens. OpenAI collected this from Common Crawl, WebText2, Books, Wikipedia—about 300 billion tokens total.
- Uncertainty: Would the gains be worth it? Nobody had trained at this scale before. The ROI was unclear.
OpenAI had the resources, the data pipeline, and—crucially—the hypothesis: scale is the thing. Jared Kaplan, Tom Henighan, and the scaling team had been running experiments on smaller models that suggested smooth power-law improvements as you scale up. Why not just go big?
The Moment
By mid-2020, the conditions were right:
- Transformers were proven and well-understood.
- The fine-tuning paradigm was showing its limits.
- GPT-1 hinted that language models have a hidden ability: in-context learning.
- Scaling experiments on smaller models showed steady improvements.
- OpenAI had compute and data.
This is where GPT-3 enters. It asked a simple question: What happens if you scale a language model to 175 billion parameters and train it on 300 billion tokens?
The answer rewrote the field.
Key Takeaways from This Section
- Pre-training + fine-tuning was the standard in NLP by 2020, but required labeled data and produced task-specific models.
- Scale was transforming other domains (vision, speech) but hadn’t been seriously attempted in language modeling.
- GPT-1 hinted that language models could perform tasks without fine-tuning, but the hint was mostly ignored.
- Only well-resourced labs could afford to train at billion-parameter scales.
- The question was ripe: Would scaling to 175B parameters unlock new capabilities?
Next: Section 02: The Problem