Sliding Window Attention (SWA)
An attention variant where each token attends only to the last W tokens (a sliding window), not all previous tokens.
An attention variant where each token attends only to the last W tokens (a sliding window), not all previous tokens. Mistral uses W = 4,096. Reduces attention computation from O(n²) to O(n×W). Information from earlier tokens still reaches the current token indirectly through multiple layers, preserving long-range dependencies while improving efficiency.