Paper 11 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova · Google AI Language · 2018

What this paper did

It flipped the direction of reading — and that change alone beat state-of-the-art on eleven language understanding benchmarks at once.

GPT-1 (Paper 10) proved that pre-training on unlabelled text transfers to downstream tasks. But GPT-1 read text only left-to-right: when predicting the next word, it could only look at the words that came before. This made it powerful for generation, but it meant every word’s representation was built on half the context — the past, never the future.

Devlin et al. asked a simple question: what if we let the model see both directions at once?

The answer was BERT — a Transformer encoder pre-trained with two objectives. The first, Masked Language Modelling (MLM), randomly covers 15% of tokens and asks the model to guess them from the surrounding words in both directions. The second, Next Sentence Prediction (NSP), asks the model to decide whether two sentences appear consecutively in text. Together, these objectives force the model to build deep, bidirectional representations of language.

The result: fine-tuned BERT exceeded human performance on SQuAD (reading comprehension), beat GPT-1 by large margins on GLUE (a suite of 9 NLP tasks), and set new records on named entity recognition and sentence inference — all from the same pre-trained checkpoint.

The key equations:

MLM objective:   L_MLM = −Σ log P(xᵢ | x₁,...,x_{i−1}, x_{i+1},...,xₙ)   [over masked positions]

NSP objective:   L_NSP = −Σ log P(IsNext | [CLS] representation)

Total loss:      L = L_MLM + L_NSP

Where xᵢ is a masked token and [CLS] is a special classification token prepended to every input whose final hidden state is used for sequence-level predictions.

The Indian analogy

Imagine a student studying for their Hindi exam from a textbook where the teacher has randomly blacked out words on each page. To figure out what the hidden word is, the student must read both what comes before and what comes after — they cannot rely on just the left side of the sentence.

This forces something powerful: the student stops skimming and starts understanding full sentences from both ends simultaneously.

BERT’s pre-training is exactly this. By masking random words and demanding the model recover them from full surrounding context, BERT is forced to build a representation of every word that incorporates the entire sentence — not just the words that preceded it. This bidirectional understanding is why BERT is dramatically better at comprehension tasks than GPT-1, which could only read left-to-right.

The second pre-training task — Next Sentence Prediction — is like asking the student: “Does paragraph B logically follow paragraph A, or was it taken from somewhere else?” Answering this requires understanding paragraph-level coherence, not just individual words.

The GPT-1 vs BERT divide

This is the most important architectural split in modern NLP:

Property	GPT-1 (Paper 10)	BERT (Paper 11)
Architecture	Transformer decoder	Transformer encoder
Reading direction	Left-to-right (causal)	Bidirectional
Pre-training objective	Predict next token	Masked token prediction + NSP
Can generate text?	Yes	No
Strength	Generation, completion	Understanding, classification
Attention mask	Causal (future is blocked)	Full (all tokens see all tokens)

Neither is strictly better — they are optimised for different purposes. GPT became the foundation for generative AI. BERT became the foundation for search, question answering, and document understanding.

Read in this order

Section	What you will learn	Difficulty	Time
1. Context	NLP in late 2018 — the unidirectional limitation of GPT-1	🟢	4 min
2. The Problem	Why left-to-right context is insufficient for understanding	🟢	3 min
3. The Idea	Bidirectional encoders, MLM, NSP, and the [CLS]/[SEP] tokens	🟡	6 min
4. The Math	MLM loss, NSP loss, WordPiece tokenisation	🔴	10 min
5. Worked Example	Step-by-step forward pass through BERT-base on a real sentence	🔴	8 min
6. The Code	MLM with HuggingFace; classification with [CLS] token	🟡	7 min
7. Limitations	Cannot generate, NSP is weak, MLM mismatch, quadratic attention	🟡	4 min
8. Impact	RoBERTa, ALBERT, DistilBERT, and BERT’s legacy in search and NLP	🟢	4 min
9. Summary	One-page recap	🟢	2 min

Also: Glossary · Quiz · Further Reading

Before you read: math tutorials you need

Conditional Probability → — MLM asks: P(masked token | all other tokens) ✅
Cross-Entropy Loss → — the MLM objective minimises cross-entropy over masked positions ✅
Softmax Function → — converts the output logits into token probabilities ✅
Transformer (Paper 08) → — BERT uses the encoder stack from this paper ✅
GPT-1 (Paper 10) → — helps understand the contrast between causal and bidirectional pre-training ✅

BERT architecture at a glance

Input: [CLS] The cat sat on the [MASK] . [SEP]
              │
              ▼
    WordPiece Token Embeddings
  + Positional Embeddings
  + Segment Embeddings (sentence A or B)
              │
              ▼
  ┌──────────────────────────────────────┐
  │  Transformer Encoder Block × 12     │  ← BERT-base
  │                                     │  (× 24 for BERT-large)
  │  Multi-Head Self-Attention (full)   │  ← all tokens see all tokens
  │  Feed-Forward Network               │
  │  Layer Norm + Residual              │
  └──────────────────────────────────────┘
              │
              ▼
  ┌──────────────────────────────────────┐
  │  [CLS] hidden state → classifier    │  ← sentence-level tasks (NSP, sentiment)
  │  [MASK] hidden state → vocabulary   │  ← MLM: predict the masked token
  │  each token hidden state → label    │  ← token-level tasks (NER, QA)
  └──────────────────────────────────────┘

← Paper 10 — GPT-1 → Paper 12 — GPT-3

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding