Attention Complexity
The computational cost of attention, typically measured in FLOPs (floating point operations).
The computational cost of attention, typically measured in FLOPs (floating point operations). Standard attention is O(n² × d) where n is sequence length and d is embedding dimension. SWA reduces this to O(n × W × d). This is why SWA makes long-context generation practical.