3. The Idea — Masking, Bidirectional Encoding, and the [CLS] Token

BERT’s core insight is simple enough to state in one sentence: instead of predicting the next token, randomly hide some tokens and predict those.

Hiding tokens and demanding the model recover them is called Masked Language Modelling (MLM). It sidesteps the trivial-copying problem entirely: the model cannot see the answer because the answer has been replaced with a special [MASK] token. To predict what the original token was, the model must use everything else in the sentence — both to the left and to the right. The masking creates the need for bidirectional attention without making the task trivially easy.

This is not a new idea in isolation — the Cloze test, invented by educational psychologist Wilson Taylor in 1953, works exactly this way: give a student a passage with random words removed and ask them to fill in the blanks. Cloze tests measure reading comprehension because completing them requires understanding the whole passage, not just sentence fragments. BERT is, in essence, a neural network that achieves high scores on a very large Cloze test.

Masked Language Modelling (MLM)

During pre-training, BERT takes each sentence and randomly selects 15% of its WordPiece tokens as candidates for masking. For each selected token:

80% of the time: replace it with [MASK]
10% of the time: replace it with a random token from the vocabulary
10% of the time: leave it unchanged

Why not always use [MASK]? Because [MASK] is a special token that appears only during pre-training, never during fine-tuning. If the model only learns to predict things when it sees [MASK], it will not know what to do with ordinary tokens during fine-tuning. By sometimes leaving the token unchanged (and still asking the model to predict it) and sometimes using a random wrong token, the model is forced to learn useful representations for all tokens — not just the masked ones. The model never knows which tokens it will be asked to predict, so it builds good representations for all of them.

Next Sentence Prediction (NSP)

Many important tasks — question answering, textual entailment, dialogue — require understanding the relationship between two sentences, not just within a single sentence. MLM trains word-level understanding but not sentence-relationship understanding.

To address this, BERT adds a second pre-training task: Next Sentence Prediction.

For each training example, BERT receives two sentences, A and B, concatenated together. In 50% of cases, B is the actual sentence that follows A in the original text. In the remaining 50%, B is a randomly sampled sentence from the corpus that has nothing to do with A. BERT must classify whether B is the genuine next sentence (label: IsNext) or a random one (label: NotNext).

The signal for this classification comes from the [CLS] token — a special token prepended to the very start of every input. Its final hidden state is a single vector that summarises the entire two-sentence input. A linear classifier on top of this vector produces the IsNext/NotNext prediction.

After pre-training, the [CLS] vector becomes the standard way to do sentence-level or sentence-pair classification during fine-tuning — for tasks like sentiment analysis, textual entailment, and relevance ranking.

Special tokens: [CLS] and [SEP]

BERT uses two special tokens that appear in every input:

[CLS] (Classification token): Prepended to the very start of every input sequence. During pre-training, its final hidden state is used for the NSP binary classification. During fine-tuning, it is used as a fixed-size representation of the entire sequence for any classification task. You can think of it as the model’s “overall impression” of the input.

[SEP] (Separator token): Used to mark the boundary between sentence A and sentence B (in two-sentence tasks), and also appended at the end of every input. The model uses this to know where one sentence ends and another begins.

A BERT input for a two-sentence task looks like this:

[CLS]  The cat sat on the mat  [SEP]  It was a comfortable mat  [SEP]

And for a single-sentence task:

[CLS]  The food at this restaurant is excellent  [SEP]

Segment embeddings

Because BERT needs to know which tokens belong to sentence A and which to sentence B, it adds a segment embedding on top of the token embedding and positional embedding. Tokens from sentence A get segment embedding A; tokens from sentence B get segment embedding B. These segment embeddings are learned during pre-training.

The total embedding for each input token is therefore:

Input embedding = Token embedding + Positional embedding + Segment embedding

All three are learned. The Transformer encoder then processes the full sequence with full bidirectional attention — every token can attend to every other token with no directional restriction.

WordPiece tokenisation

BERT does not tokenise text at the word level. It uses WordPiece, a subword tokenisation algorithm. Long or uncommon words are split into smaller pieces. For example:

"unbelievable"  →  ["un", "##believable"]
"playing"       →  ["playing"]            (common enough to stay whole)
"Transformers"  →  ["Trans", "##form", "##ers"]

The ## prefix indicates a continuation piece (attached to the previous piece, no space). WordPiece has a fixed vocabulary — BERT-base uses 30,522 vocabulary entries. This handles rare words, misspellings, and words from multiple languages without an infinite vocabulary. It also means that even words the model has never seen as complete units can still be processed, because their component pieces are likely present in the vocabulary.

BERT-base and BERT-large

Devlin et al. released two model sizes:

Model	Transformer layers	Hidden size	Attention heads	Parameters
BERT-base	12	768	12	110 million
BERT-large	24	1024	16	340 million

BERT-base is roughly comparable in compute to GPT-1. BERT-large is significantly larger. Both were pre-trained on the same data: English Wikipedia (2.5 billion words) plus BooksCorpus (800 million words), for a total of roughly 3.3 billion words.

Pre-training BERT-large took 4 days on 64 TPU chips. This was state-of-the-art compute for a research lab at the time. The resulting checkpoints were released publicly and became the foundation for hundreds of subsequent models.