Causal mask (autoregressive mask)

Appears in 1 paper

A (T × T) upper-triangular boolean mask applied in the decoder's self-attention.

As used in Paper 08 — Attention Is All You Need →

A (T × T) upper-triangular boolean mask applied in the decoder's self-attention. Position i is blocked from attending to positions j > i (future positions). This is applied during training so the model learns to predict each token using only past context. At inference, this is automatically satisfied by generating left-to-right.

Paper 08 — Attention Is All You Need →

Appears in papers