Multi-head attention (MHA)
`Concat(head₁, ..., headₕ) · W^O`.
Concat(head₁, ..., headₕ) · W^O. Running h attention operations in parallel lets the model attend to h different types of relationships simultaneously. Critical to the Transformer's expressiveness.
The standard attention mechanism in Transformers, where the embedding is split into n_heads independent "attention heads," each computing attention separately. Each head has its own query (Q), key (K), and value (V) projections. Results from all heads are concatenated and projected back. MHA is expressive but memory-expensive because it requires separate KV storage for each head.