7. Limitations — What BERT Cannot Do and Where It Falls Short

BERT was a landmark achievement, but it has real limitations — some structural, some empirical, some that led directly to the papers that followed.

BERT cannot generate text

This is a hard architectural constraint, not a tuning issue.

BERT is an encoder. Its self-attention is fully bidirectional — every token can see every other token. This is ideal for understanding a given input, but it makes text generation impossible in the standard autoregressive sense. To generate text, you need to produce tokens one at a time, conditioning each new token only on the ones already generated. A bidirectional model cannot do this cleanly: if you feed it a partial sequence and ask it to generate the next token, it would need to see what comes after that token to produce its representation — but what comes after does not exist yet.

GPT-1’s causal (left-to-right) masking makes generation natural. BERT’s bidirectionality makes it useless for generation. This is the fundamental trade-off: BERT became the dominant model for understanding tasks; GPT became the dominant model for generation tasks. The two lineages diverged here and never converged back.

The [MASK] token does not appear during fine-tuning (pre-training/fine-tuning mismatch)

During pre-training, BERT sees [MASK] tokens constantly. During fine-tuning and inference, [MASK] never appears — you feed the model ordinary text without any masking. This creates a mismatch: the model was optimised partly to handle [MASK] tokens, but it never encounters them in deployment.

BERT’s authors partially addressed this by applying masking only 80% of the time (the other 20% uses a random token or leaves the original). This forces the model to build useful representations even when no [MASK] token is present. But the mismatch is not fully eliminated — it remains a theoretical concern. Subsequent models like RoBERTa and XLNet explored different pre-training strategies to address this.

MLM only trains on 15% of tokens per sequence

Because only 15% of tokens are masked in each forward pass, each example provides gradient signal for only 15% of its tokens. Training on the other 85% contributes no loss — those tokens’ representations are updated only indirectly. This makes BERT’s pre-training less sample-efficient than GPT-1’s next-token prediction, which provides gradient signal for every token in every sequence.

In practice, BERT compensates with more compute and larger training data. But it means training a BERT from scratch requires significantly more resources than training an equivalently-sized GPT model.

NSP is a weak task and may not help

Later research — particularly RoBERTa (Liu et al., 2019) — showed that the Next Sentence Prediction objective may actually hurt rather than help. When RoBERTa trained BERT without NSP (using only MLM) and with more data and compute, it exceeded BERT-large on most benchmarks. This suggests that the NSP task, as formulated, does not teach meaningful cross-sentence reasoning. The random negative pairs (sentence B from a completely different document) may be too easy — a model can distinguish IsNext from NotNext using superficial topic cues rather than genuine coherence understanding.

Quadratic attention cost limits sequence length

Like all Transformer models, BERT’s self-attention is O(n²) in sequence length. BERT-base supports sequences up to 512 tokens. Many documents — legal contracts, research papers, books — are far longer than 512 tokens. BERT must either truncate them (losing information) or process them in chunks with no cross-chunk attention (losing long-range coherence).

This is a fundamental limitation of the Transformer architecture, not specific to BERT. Subsequent models like Longformer and BigBird introduced sparse attention patterns to extend this. But standard BERT cannot handle long documents without truncation.

Fine-tuning can be unstable on small datasets

On datasets with fewer than a few thousand examples, fine-tuning BERT can be brittle — results vary significantly across random seeds and learning rates. The high dimensionality of the model (110M parameters for BERT-base) relative to a small fine-tuning dataset creates variance. Getting reliable results requires running multiple fine-tuning seeds and reporting averages, which adds compute cost.

Summary of limitations

BERT cannot generate text (encoder only), has a pre-training/fine-tuning mismatch due to [MASK], trains on only 15% of tokens per step, may be hurt by NSP, cannot handle sequences longer than 512 tokens, and can be unstable on tiny fine-tuning datasets. These limitations motivated RoBERTa, ALBERT, DistilBERT, Longformer, and many other models over the two years that followed.