Multi-head attention (MHA)

Appears in 2 papers

`Concat(head₁, ..., headₕ) · W^O`.

As used in Paper 08 — Attention Is All You Need →

Concat(head₁, ..., headₕ) · W^O. Running h attention operations in parallel lets the model attend to h different types of relationships simultaneously. Critical to the Transformer's expressiveness.

As used in Paper 18 — Mistral 7B →

The standard attention mechanism in Transformers, where the embedding is split into n_heads independent "attention heads," each computing attention separately. Each head has its own query (Q), key (K), and value (V) projections. Results from all heads are concatenated and projected back. MHA is expressive but memory-expensive because it requires separate KV storage for each head.