BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper 11 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova · Google AI Language · 2018
What this paper did
It flipped the direction of reading — and that change alone beat state-of-the-art on eleven language understanding benchmarks at once.
GPT-1 (Paper 10) proved that pre-training on unlabelled text transfers to downstream tasks. But GPT-1 read text only left-to-right: when predicting the next word, it could only look at the words that came before. This made it powerful for generation, but it meant every word’s representation was built on half the context — the past, never the future.
Devlin et al. asked a simple question: what if we let the model see both directions at once?
The answer was BERT — a Transformer encoder pre-trained with two objectives. The first, Masked Language Modelling (MLM), randomly covers 15% of tokens and asks the model to guess them from the surrounding words in both directions. The second, Next Sentence Prediction (NSP), asks the model to decide whether two sentences appear consecutively in text. Together, these objectives force the model to build deep, bidirectional representations of language.
The result: fine-tuned BERT exceeded human performance on SQuAD (reading comprehension), beat GPT-1 by large margins on GLUE (a suite of 9 NLP tasks), and set new records on named entity recognition and sentence inference — all from the same pre-trained checkpoint.
The key equations:
MLM objective: L_MLM = −Σ log P(xᵢ | x₁,...,x_{i−1}, x_{i+1},...,xₙ) [over masked positions]
NSP objective: L_NSP = −Σ log P(IsNext | [CLS] representation)
Total loss: L = L_MLM + L_NSP
Where xᵢ is a masked token and [CLS] is a special classification token prepended to every input whose final hidden state is used for sequence-level predictions.
The Indian analogy
Imagine a student studying for their Hindi exam from a textbook where the teacher has randomly blacked out words on each page. To figure out what the hidden word is, the student must read both what comes before and what comes after — they cannot rely on just the left side of the sentence.
This forces something powerful: the student stops skimming and starts understanding full sentences from both ends simultaneously.
BERT’s pre-training is exactly this. By masking random words and demanding the model recover them from full surrounding context, BERT is forced to build a representation of every word that incorporates the entire sentence — not just the words that preceded it. This bidirectional understanding is why BERT is dramatically better at comprehension tasks than GPT-1, which could only read left-to-right.
The second pre-training task — Next Sentence Prediction — is like asking the student: “Does paragraph B logically follow paragraph A, or was it taken from somewhere else?” Answering this requires understanding paragraph-level coherence, not just individual words.
The GPT-1 vs BERT divide
This is the most important architectural split in modern NLP:
| Property | GPT-1 (Paper 10) | BERT (Paper 11) |
|---|---|---|
| Architecture | Transformer decoder | Transformer encoder |
| Reading direction | Left-to-right (causal) | Bidirectional |
| Pre-training objective | Predict next token | Masked token prediction + NSP |
| Can generate text? | Yes | No |
| Strength | Generation, completion | Understanding, classification |
| Attention mask | Causal (future is blocked) | Full (all tokens see all tokens) |
Neither is strictly better — they are optimised for different purposes. GPT became the foundation for generative AI. BERT became the foundation for search, question answering, and document understanding.
Read in this order
| Section | What you will learn | Difficulty | Time |
|---|---|---|---|
| 1. Context | NLP in late 2018 — the unidirectional limitation of GPT-1 | 🟢 | 4 min |
| 2. The Problem | Why left-to-right context is insufficient for understanding | 🟢 | 3 min |
| 3. The Idea | Bidirectional encoders, MLM, NSP, and the [CLS]/[SEP] tokens | 🟡 | 6 min |
| 4. The Math | MLM loss, NSP loss, WordPiece tokenisation | 🔴 | 10 min |
| 5. Worked Example | Step-by-step forward pass through BERT-base on a real sentence | 🔴 | 8 min |
| 6. The Code | MLM with HuggingFace; classification with [CLS] token | 🟡 | 7 min |
| 7. Limitations | Cannot generate, NSP is weak, MLM mismatch, quadratic attention | 🟡 | 4 min |
| 8. Impact | RoBERTa, ALBERT, DistilBERT, and BERT’s legacy in search and NLP | 🟢 | 4 min |
| 9. Summary | One-page recap | 🟢 | 2 min |
Also: Glossary · Quiz · Further Reading
Before you read: math tutorials you need
- Conditional Probability → — MLM asks: P(masked token | all other tokens) ✅
- Cross-Entropy Loss → — the MLM objective minimises cross-entropy over masked positions ✅
- Softmax Function → — converts the output logits into token probabilities ✅
- Transformer (Paper 08) → — BERT uses the encoder stack from this paper ✅
- GPT-1 (Paper 10) → — helps understand the contrast between causal and bidirectional pre-training ✅
BERT architecture at a glance
Input: [CLS] The cat sat on the [MASK] . [SEP]
│
▼
WordPiece Token Embeddings
+ Positional Embeddings
+ Segment Embeddings (sentence A or B)
│
▼
┌──────────────────────────────────────┐
│ Transformer Encoder Block × 12 │ ← BERT-base
│ │ (× 24 for BERT-large)
│ Multi-Head Self-Attention (full) │ ← all tokens see all tokens
│ Feed-Forward Network │
│ Layer Norm + Residual │
└──────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ [CLS] hidden state → classifier │ ← sentence-level tasks (NSP, sentiment)
│ [MASK] hidden state → vocabulary │ ← MLM: predict the masked token
│ each token hidden state → label │ ← token-level tasks (NER, QA)
└──────────────────────────────────────┘
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.