Causal (masked) self-attention
Self-attention where position i is prevented from attending to positions j > i (future tokens).
Self-attention where position i is prevented from attending to positions j > i (future tokens). Enforced by setting attention logits to −∞ before softmax. Necessary for the autoregressive training objective.
A language model that generates tokens one at a time, conditioning each token only on the tokens generated so far. GPT-1 is causal. BERT is not — it cannot generate text autoregressively.