8. Impact — How BERT Changed NLP, Search, and the Encoder-Decoder World
BERT was not just a research result — it was the model that brought transformer-based pre-training into production at industrial scale, almost overnight.
Immediate benchmark domination
When the BERT paper appeared on arXiv in October 2018, it set new records on 11 NLP benchmarks simultaneously. The GLUE benchmark — a suite of 9 natural language understanding tasks including sentiment, inference, and question-answering — went from a previous best of around 69 points to 80.5 with BERT-large. SQuAD (Stanford Question Answering Dataset), the hardest reading comprehension benchmark at the time, saw BERT exceed the published human-level score for the first time. The research community had seen impressive results before, but this breadth of simultaneous improvement across such diverse tasks was unprecedented.
Google Search
In late 2019, Google announced that BERT had been deployed in Google Search — specifically for understanding complex, conversational queries in English. For queries like “Can you get medicine for someone pharmacy?” (where word order and context carry the meaning), BERT’s bidirectional context dramatically improved the relevance of results compared to keyword-based or bag-of-words approaches.
This was one of the largest deployments of a research language model in history. BERT was running on nearly every English-language search query at Google. The model that started as a research paper in 2018 was, within a year, shaping what billions of people saw when they searched the internet.
The BERT family
BERT’s release triggered an immediate wave of derivative models, each addressing one of its limitations:
RoBERTa (Liu et al., 2019, Facebook AI): Trained BERT without NSP, on 10× more data, with larger batch sizes and longer training. Beat BERT-large on every benchmark, demonstrating that NSP was unnecessary and more data and compute mattered more than the second pre-training objective.
ALBERT (Lan et al., 2019, Google): Reduced BERT’s parameter count by factorising the token embedding matrix and sharing weights across Transformer layers. ALBERT-xxlarge had fewer parameters than BERT-large but matched or exceeded its performance — showing that raw parameter count was not the only axis that mattered.
DistilBERT (Sanh et al., 2019, HuggingFace): Applied knowledge distillation to compress BERT-base into a model 40% smaller and 60% faster, retaining 97% of its performance. DistilBERT made BERT practical for deployment on edge devices and mobile.
BERT-multilingual: Google released a single BERT model pre-trained on 104 languages simultaneously. By sharing parameters across languages, the model learned cross-lingual representations that transfer between languages — fine-tune on English NER data and it works reasonably well on Hindi or French NER.
BioBERT, LegalBERT, SciBERT: Domain-specific versions of BERT pre-trained on medical literature, legal documents, and scientific papers respectively. Because the vocabulary and writing style of these domains differs significantly from Wikipedia and books, domain-specific pre-training dramatically improved performance on domain-specific tasks.
The encoder-decoder synthesis
BERT’s success with encoders and GPT’s success with decoders led naturally to the question: what happens if you combine both?
T5 (Raffel et al., 2019, Google) framed every NLP task as a text-to-text problem and pre-trained an encoder-decoder Transformer on a massive cleaned web corpus. T5 demonstrated that a unified architecture could handle classification, generation, translation, and summarisation with the same model — by casting every task as: given input text, produce output text.
This synthesis — BERT’s pre-training philosophy applied to an encoder-decoder architecture — became the foundation for most of the instruction-following models that followed, including the early GPT-3 fine-tunes and eventually models like FLAN, T0, and Alpaca.
The lasting divide
Despite the many successors, BERT’s core insight remains the dominant approach for language understanding tasks in production today. When you search Google, when a bank runs sentiment analysis on customer feedback, when a hospital system extracts diagnoses from clinical notes, when an e-commerce site matches product queries to listings — the underlying model is almost always a BERT variant or a model trained with BERT’s bidirectional pre-training philosophy.
GPT and its descendants dominate generation. BERT and its descendants dominate understanding. The 2018 paper that introduced masked language modelling drew the line between these two families, and that line has not moved.