Attention Mechanism
The key innovation in Transformers that allows each token to consider the relevance of all other tokens in the sequence.
The key innovation in Transformers that allows each token to consider the relevance of all other tokens in the sequence. Self-attention weights represent how much each token should "attend to" each other token. With 96 heads and 96 layers, GPT-3 has many parallel attention operations.
The core operation in Transformers. Each token computes a weighted average of all other tokens' representations, learning which tokens to focus on. The weights are learned during training.
The core component of a Transformer where each token learns to focus on other tokens. Computed as: softmax(Q·K^T / √d_head)·V. In Gemini, attention is applied uniformly to image patches and text tokens.