Causal (masked) self-attention

Appears in 2 papers

Self-attention where position i is prevented from attending to positions j > i (future tokens).

As used in Paper 10 — Improving Language Understanding by Generative Pre-Training →

Self-attention where position i is prevented from attending to positions j > i (future tokens). Enforced by setting attention logits to −∞ before softmax. Necessary for the autoregressive training objective.

As used in Paper 11 — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding →

A language model that generates tokens one at a time, conditioning each token only on the tokens generated so far. GPT-1 is causal. BERT is not — it cannot generate text autoregressively.