Section 09

Summary: BERT in one page

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2018

9. Summary — BERT in One Page


The one-sentence version

BERT is a Transformer encoder pre-trained to predict randomly masked words from their full bidirectional context — and that bidirectionality is what makes it dramatically better than GPT-1 at language understanding tasks.


The problem it solved

GPT-1 proved that pre-training works. But GPT-1 only read text left-to-right, which is fine for generation but discards half the available context for understanding. “The bank by the river” and “The bank charged fees” need rightward context to disambiguate the word “bank.” A left-to-right model cannot use it. BERT can.


The key ideas

Masked Language Modelling (MLM): Randomly mask 15% of tokens. Ask the model to predict them from all surrounding tokens — both left and right. This forces bidirectional context without making the task trivially easy (the model cannot simply copy the answer, because it has been replaced with [MASK]).

Next Sentence Prediction (NSP): Classify whether sentence B genuinely follows sentence A, or is randomly sampled. Teaches sentence-level coherence. (Later shown to be less important than MLM.)

[CLS] token: Prepended to every input. Its final hidden state is a vector summary of the entire sequence, used for classification tasks.

[SEP] token: Marks the boundary between sentence A and sentence B in two-sentence inputs.

WordPiece tokenisation: Splits rare words into subword pieces. 30,522-token vocabulary for BERT-base. Handles words not seen during training.


The architecture numbers

BERT-baseBERT-large
Layers1224
Hidden size7681024
Attention heads1216
Parameters110M340M
Training data3.3B words (Wikipedia + BooksCorpus)same

The GPT-1 vs BERT contrast

GPT-1BERT
ArchitectureDecoderEncoder
DirectionLeft-to-rightBidirectional
Pre-training objectivePredict next tokenPredict masked tokens + NSP
Can generate?YesNo
StrengthGenerationUnderstanding

The Indian analogy

A student studying with words randomly blacked out in the textbook, forced to guess each hidden word from both what came before and what came after. This forces bidirectional reading and deep understanding — not skimming. BERT’s pre-training is this process, applied to billions of sentences.


The results

  • GLUE: 80.5 (previous best: ~69) — a suite of 9 NLP tasks
  • SQuAD 1.1: 93.2 F1 — exceeding the published human score
  • SQuAD 2.0: 83.1 F1 — new state-of-the-art
  • 11 benchmarks improved simultaneously with a single model and checkpoint

What came next

RoBERTa (more data, no NSP) → ALBERT (fewer parameters, same performance) → DistilBERT (40% smaller, 60% faster) → domain-specific BERTs (BioBERT, LegalBERT) → T5 (encoder-decoder combining BERT and GPT ideas). BERT’s bidirectional pre-training philosophy now powers most language understanding systems in production worldwide.


Paper 10 — GPT-1    → Paper 12 — GPT-3

🎉 You've finished this paper!