Causal mask (autoregressive mask)
A (T × T) upper-triangular boolean mask applied in the decoder's self-attention.
A (T × T) upper-triangular boolean mask applied in the decoder's self-attention. Position i is blocked from attending to positions j > i (future positions). This is applied during training so the model learns to predict each token using only past context. At inference, this is automatically satisfied by generating left-to-right.