Multi-Query Attention (MQA)

Appears in 1 paper

An attention variant where all query heads share a single key-value head.

As used in Paper 18 — Mistral 7B →

An attention variant where all query heads share a single key-value head. Dramatically reduces KV cache (n_heads × reduction), but hurts model quality because all queries are forced to use identical keys and values. Mistral's GQA is the middle ground between MHA (full expressiveness, high memory) and MQA (low memory, low quality).