6. The Code — MLM with HuggingFace and Classification with [CLS]
Runs free on Google Colab. Install:
pip install transformers torchTwo code blocks: (1) fill-mask — see BERT predict masked tokens; (2) sentence classification using the [CLS] vector.
Code Block 1: Fill-Mask — BERT predicts masked tokens
# Install once: pip install transformers torch
from transformers import BertTokenizer, BertForMaskedLM
import torch
# Load BERT-base-uncased (110M parameters, pre-trained checkpoint)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")
model.eval() # evaluation mode — no dropout
# Input: sentence with [MASK] token
sentence = "The cat sat on the [MASK]."
inputs = tokenizer(sentence, return_tensors="pt") # pt = PyTorch tensors
# Find which position has the [MASK] token
mask_idx = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
print(f"[MASK] is at position: {mask_idx.item()}") # Expected: 6
# Forward pass — no gradients needed for inference
with torch.no_grad():
outputs = model(**inputs) # logits shape: (1, seq_len, vocab_size)
# Get logits for the masked position
logits_at_mask = outputs.logits[0, mask_idx, :] # shape: (1, 30522)
probs = torch.softmax(logits_at_mask, dim=-1) # convert to probabilities
# Print top-5 predictions
top5 = torch.topk(probs, 5)
print("\nTop-5 predictions for [MASK]:")
for score, token_id in zip(top5.values[0], top5.indices[0]):
word = tokenizer.decode([token_id])
print(f" '{word}': {score.item():.4f}")
# Expected output (something like):
# 'floor': 0.2341
# 'mat': 0.1823
# 'ground': 0.1102
# 'table': 0.0876
# 'shelf': 0.0512
Try changing the sentence to see how context affects predictions. Replace “cat” with “politician” and see if the predictions for [MASK] change — they will, because BERT reads the full sentence in both directions.
Code Block 2: Sentence classification using the [CLS] vector
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased") # base model, no task head
model.eval()
# Two sentences to compare
sentence_a = "The movie was absolutely fantastic — I loved every scene."
sentence_b = "The food at the canteen was cold and tasteless."
# Tokenise both sentences
inputs = tokenizer(
[sentence_a, sentence_b],
padding=True, # pad shorter sentence to match longer
return_tensors="pt"
)
with torch.no_grad():
# outputs.last_hidden_state shape: (batch_size, seq_len, hidden_size)
# outputs.pooler_output is the [CLS] vector, shape: (batch_size, hidden_size)
outputs = model(**inputs)
# Extract the [CLS] vector for each sentence
cls_vectors = outputs.pooler_output # shape: (2, 768)
print(f"CLS vector shape: {cls_vectors.shape}") # (2, 768)
# Compute cosine similarity between the two CLS vectors
# (measures how similar the model thinks the two sentences are)
a = cls_vectors[0] # sentence A
b = cls_vectors[1] # sentence B
similarity = torch.dot(a, b) / (torch.norm(a) * torch.norm(b))
print(f"\nCosine similarity between A and B: {similarity.item():.4f}")
# Sentences about very different topics → low similarity (closer to 0)
# In a real classifier, you would:
# 1. Load a labelled dataset (e.g. sentiment: positive/negative)
# 2. Add a linear layer on top of cls_vectors: nn.Linear(768, num_classes)
# 3. Fine-tune the whole model on that dataset
# 4. At test time, pass a sentence → get cls_vector → linear layer → class prediction
print("\nThe CLS vector is a 768-dim fingerprint of the sentence.")
print("Fine-tune a linear layer on top to classify any sentence property.")
What just happened?
In Code Block 1, BERT ran a full bidirectional forward pass over your sentence, using context from both sides of the [MASK] to predict the most likely word. The word “mat” ranked high because BERT has seen “sat on the mat” countless times during pre-training.
In Code Block 2, you extracted the [CLS] vector — the 768-dimensional summary that BERT builds for the entire sentence. This is what gets plugged into a linear classifier for fine-tuning. The pre-trained CLS vector already carries useful sentence-level semantics — positive sentiment sentences cluster together in the 768-dimensional space, even before any fine-tuning.
The critical point: in a fine-tuning scenario, you would train the linear classifier and simultaneously update all of BERT’s parameters on your labelled dataset. The pre-trained parameters give the model a massive head start — you typically need only a few hundred to a few thousand labelled examples, not millions.