Attention Mechanism

Appears in 3 papers

The key innovation in Transformers that allows each token to consider the relevance of all other tokens in the sequence.

As used in Paper 12 — Language Models are Few-Shot Learners →

The key innovation in Transformers that allows each token to consider the relevance of all other tokens in the sequence. Self-attention weights represent how much each token should "attend to" each other token. With 96 heads and 96 layers, GPT-3 has many parallel attention operations.

As used in Paper 13 — Scaling Laws for Neural Language Models →

The core operation in Transformers. Each token computes a weighted average of all other tokens' representations, learning which tokens to focus on. The weights are learned during training.

As used in Paper 20 — Gemini: A Family of Highly Capable Multimodal Models →

The core component of a Transformer where each token learns to focus on other tokens. Computed as: softmax(Q·K^T / √d_head)·V. In Gemini, attention is applied uniformly to image patches and text tokens.