Section 02

The Problem: Why left-to-right context is not enough for understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding 2018

2. The Problem — Why Left-to-Right Context Is Not Enough for Understanding

Language understanding is fundamentally a bidirectional activity.

When you read a sentence, you do not process it strictly word by word, left to right, committing to the meaning of each word before seeing the next. You read the sentence, let your eye skip ahead, revise earlier interpretations, and eventually settle on a coherent understanding of the whole. The meaning of any word in a sentence depends on the words around it — both to the left and to the right.

This is not a subtle point. Consider three sentences:

  1. “I went to the bank to deposit money.”
  2. “The bank of the river was muddy after the flood.”
  3. “She sat on the bank, watching the fish jump.”

In each sentence, the word “bank” has a different meaning — financial institution, river bank, and a physical seat or slope. A left-to-right model processing the word “bank” has seen the same prefix in all three cases: “I went to the”, “The”, and “She sat on the”. That prefix contains almost no useful disambiguating information. The word “river,” “deposit,” and “fish” — which clearly signal the correct meaning — all appear later in the sentence.

A causal language model assigns the same contextual representation to “bank” in all three sentences until the disambiguating words appear. By then, the representation has already been produced and passed downstream. The model cannot go back and revise it.

This is not an edge case — it is the norm. Language is full of pronouns, ellipsis, anaphora (references back to earlier things), and garden-path sentences that require right-context to parse correctly.


Why this hurts specific tasks

Reading Comprehension (e.g. SQuAD): The question “Where did the child hide?” asks you to find the answer in a passage. To locate the answer span, the model must deeply understand the relationship between the question and every sentence in the passage. That understanding requires representing each passage word in the context of all surrounding words — including those that appear after it.

Named Entity Recognition: To tag “Delhi” as a city vs. “Delhi” as part of a company name (“Delhi Capitals”), the model needs the words that come after. A left-to-right model has to wait until it has processed the sentence and cannot revise early tags.

Textual Entailment: Does “A man is playing guitar” entail “A person is playing a musical instrument”? Correctly parsing both sentences requires full bidirectional context — every word’s meaning is shaped by the sentence it sits in.


Why you cannot just train a bidirectional autoregressive model

The obvious fix is: remove the causal mask, let every token attend to every other token, and keep predicting the next token. This fails immediately.

If the model can see token 7 when predicting token 7, the task is trivially solved by copying. The model learns nothing — it just attends to the token it is supposed to predict and echoes it back. Removing the causal mask while keeping the next-token objective does not produce a bidirectional language model. It produces a model that memorises and does no learning at all.

You need a different training objective — one that requires the model to use bidirectional context without trivially seeing the answer. BERT’s solution is the topic of the next section.