Attention Complexity

Appears in 1 paper

The computational cost of attention, typically measured in FLOPs (floating point operations).

As used in Paper 18 — Mistral 7B →

The computational cost of attention, typically measured in FLOPs (floating point operations). Standard attention is O(n² × d) where n is sequence length and d is embedding dimension. SWA reduces this to O(n × W × d). This is why SWA makes long-context generation practical.