Causal Masking

Appears in 2 papers

In autoregressive language modelling, preventing the model from attending to future tokens (tokens that come after the current position).

As used in Paper 18 — Mistral 7B →

In autoregressive language modelling, preventing the model from attending to future tokens (tokens that come after the current position). Implemented by setting attention scores to -∞ for future positions before softmax. Essential for ensuring models don't cheat by reading ahead. Mistral combines causal masking with sliding window masking.

As used in Paper 19 — Ring Attention with Blockwise Transformers for Near-Infinite Context →

In autoregressive language generation, preventing attention to future tokens. Token t cannot attend to tokens beyond position t. Ring Attention requires careful masking to enforce causality as KV chunks circulate across GPUs.