Sliding Window Attention (SWA)

Appears in 1 paper

An attention variant where each token attends only to the last W tokens (a sliding window), not all previous tokens.

An attention variant where each token attends only to the last W tokens (a sliding window), not all previous tokens. Mistral uses W = 4,096. Reduces attention computation from O(n²) to O(n×W). Information from earlier tokens still reaches the current token indirectly through multiple layers, preserving long-range dependencies while improving efficiency.

Paper 18 — Mistral 7B →

Appears in papers