Scaled dot-product attention

Appears in 2 papers

`Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V`.

As used in Paper 08 — Attention Is All You Need →

Attention(Q, K, V) = softmax(Q·Kᵀ / √dₖ) · V. The fundamental operation of the Transformer. "Scaled" refers to the √dₖ divisor. "Dot-product" refers to Q·Kᵀ (matrix of all pairwise dot products). This is faster than Bahdanau's additive attention and maps directly to efficient GPU matrix operations.

As used in Paper 18 — Mistral 7B →

The core attention mechanism: Scores = Q @ K^T / √d_head, then softmax(Scores) @ V. This computes which tokens are relevant to the current token. Scaling by 1/√d_head stabilises training. Standard in all Transformer variants including Mistral.