The Math: Tokenisation, Embeddings, and Attention
Prerequisites: Matrix Multiplication and Projections, Softmax and Attention
Part 1: Image Tokenisation
Step 1: Patch Division
An image I ∈ ℝ^(H × W × C) (height × width × channels) is divided into non-overlapping patches.
For a standard 224×224 RGB image with patch size 14×14:
Number of patches = (H / patch_size) × (W / patch_size)
= (224 / 14) × (224 / 14)
= 16 × 16
= 256 patches
Each patch P_ij is a 14×14×3 = 588-dimensional vector (RGB values flattened).
Step 2: Patch Embedding
Each patch is projected to d_model dimensions using a learned linear layer:
e_patch = W_patch @ flatten(P_ij) + b_patch
Where:
- W_patch ∈ ℝ^(d_model × 588) is the patch projection weight matrix
- e_patch ∈ ℝ^d_model is the embedded patch
For Gemini, d_model ≈ 2048 for Ultra.
Worked Numerical Example
Setup:
- Image size: 224×224 (standard)
- Patch size: 14×14
- d_model: 2048 (Gemini Ultra)
Calculation:
Number of patches:
patches = (224 / 14) × (224 / 14) = 16 × 16 = 256
Size of one patch vector:
patch_vector_size = 14 × 14 × 3 = 196 × 3 = 588
Embedding matrix dimensions:
W_patch ∈ ℝ^(2048 × 588) (projects 588-D patch → 2048-D embedding)
Example patch embedding (concrete numbers):
One 14×14 pixel patch in RGB:
P = [
[255, 128, 64], # top-left pixel (reddish)
[254, 127, 63], # next pixel
... 194 more pixels ...
]
Flattened: [255, 128, 64, 254, 127, 63, ..., 200, 100, 50] (588 values)
e_patch = W_patch @ patch + b_patch
∈ ℝ^2048
After processing all 256 patches, we have a sequence of 256 embeddings, each 2048-dimensional.
Part 2: Text Tokenisation
Text uses SentencePiece, a subword tokeniser with 256K vocabulary.
For example:
Text: "The cat sat on the mat"
Tokens: [The, cat, sat, on, the, mat] (6 tokens)
Embeddings:
e_text[0] = W_text[token_id("The")] ∈ ℝ^2048
e_text[1] = W_text[token_id("cat")] ∈ ℝ^2048
...
Where W_text ∈ ℝ^(256000 × 2048) is the word embedding matrix (256K vocabulary × 2048 dimensions).
Part 3: Token Concatenation
Now we have:
- 256 image patch embeddings (each ∈ ℝ^2048)
- 6 text token embeddings (each ∈ ℝ^2048)
These are concatenated into one sequence:
X = [e_patch[0], e_patch[1], ..., e_patch[255],
e_text[0], e_text[1], ..., e_text[5]]
X ∈ ℝ^(262 × 2048) (262 tokens total, each 2048-D)
The Transformer doesn’t care that tokens 0–255 came from an image and 256–261 came from text. They’re all just tokens.
Part 4: Positional Encoding
The Transformer needs to know the order of tokens. Gemini uses absolute positional encodings (like most Transformers):
pos_enc[i] = [
sin(i / 10000^(0/d_model)),
cos(i / 10000^(2/d_model)),
sin(i / 10000^(4/d_model)),
...
]
pos_enc[i] ∈ ℝ^d_model
Each token’s embedding is added to its positional encoding:
X_pos[i] = X[i] + pos_enc[i]
Worked Example (Smaller Dimension)
Let d_model = 4 (simplified; real: 2048). Compute positional encoding for position i=0 and i=1:
Position 0:
pos_enc[0] = [sin(0/1), cos(0/1), sin(0/100), cos(0/100)]
= [sin(0), cos(0), sin(0), cos(0)]
= [0, 1, 0, 1]
Position 1:
pos_enc[1] = [sin(1/1), cos(1/1), sin(1/100), cos(1/100)]
= [sin(1), cos(1), sin(0.01), cos(0.01)]
≈ [0.841, 0.540, 0.010, 0.999] (computed exactly)
So if token 0 (image patch) had embedding [0.5, 0.2, 0.3, 0.1], it becomes:
[0.5, 0.2, 0.3, 0.1] + [0, 1, 0, 1] = [0.5, 1.2, 0.3, 1.1]
Part 5: Attention Mechanism
The core of the Transformer. For a sequence of length n (262 in our example):
Query, Key, Value Projections
Each token embedding X_pos[i] is projected into three spaces:
Q[i] = W_Q @ X_pos[i] (Query, ∈ ℝ^d_head)
K[i] = W_K @ X_pos[i] (Key, ∈ ℝ^d_head)
V[i] = W_V @ X_pos[i] (Value, ∈ ℝ^d_head)
Where W_Q, W_K, W_V ∈ ℝ^(d_head × d_model)
Typically d_head = d_model / num_heads, e.g., 2048 / 32 = 64
Attention Weights
For each position i, compute similarity to all other positions:
scores[i, j] = Q[i] · K[j] / √d_head (dot product similarity)
For i=100 (an image patch), it attends to:
j=0-255 (other image patches)
j=256-261 (text tokens)
All treated equally by the math.
Softmax Normalization
attn_weights[i, j] = exp(scores[i, j]) / Σ_k exp(scores[i, k])
For our example with 262 tokens:
Denominator = Σ_{k=0}^{261} exp(scores[i, k])
Weighted Sum
The output for position i is the weighted sum of all values:
attn_output[i] = Σ_j attn_weights[i, j] × V[j]
Worked Attention Example
Simplified: 3 tokens (image patch + 2 text words), d_head = 2.
Q[0] = [0.5, 0.3] (image patch query)
Q[1] = [0.2, 0.8] (text word "cat" query)
Q[2] = [0.7, 0.1] (text word "sat" query)
K[0] = [0.6, 0.4]
K[1] = [0.3, 0.9]
K[2] = [0.5, 0.2]
V[0] = [1.0, 0.0]
V[1] = [0.0, 1.0]
V[2] = [0.5, 0.5]
Compute attention for token 0 (image patch):
scores[0, 0] = Q[0] · K[0] / √2 = (0.5×0.6 + 0.3×0.4) / √2 = 0.42 / 1.414 ≈ 0.297
scores[0, 1] = Q[0] · K[1] / √2 = (0.5×0.3 + 0.3×0.9) / √2 = 0.42 / 1.414 ≈ 0.297
scores[0, 2] = Q[0] · K[2] / √2 = (0.5×0.5 + 0.3×0.2) / √2 = 0.31 / 1.414 ≈ 0.219
Softmax:
exp(0.297) ≈ 1.345
exp(0.297) ≈ 1.345
exp(0.219) ≈ 1.245
Sum = 3.935
attn_weights[0, :] = [1.345/3.935, 1.345/3.935, 1.245/3.935]
≈ [0.342, 0.342, 0.316]
These sum to 1.0 (normalized).
Output:
attn_output[0] = 0.342 × V[0] + 0.342 × V[1] + 0.316 × V[2]
= 0.342 × [1, 0] + 0.342 × [0, 1] + 0.316 × [0.5, 0.5]
= [0.342, 0] + [0, 0.342] + [0.158, 0.158]
= [0.5, 0.5]
Key insight: The image patch (token 0) has learned to blend information from all three tokens based on how similar its query is to their keys. The weights (0.342, 0.342, 0.316) show it attends roughly equally to the image and first text word, slightly less to the second text word.
Part 6: Multi-Head Attention
In practice, Gemini uses multiple “heads” (typically 32 or 64) that attend independently:
For each head h:
attn_h = Attention(Q_h, K_h, V_h)
Multi-head output = Concat(attn_0, attn_1, ..., attn_31) @ W_o
Where W_o ∈ ℝ^(d_model × d_model) recombines the heads.
This allows different heads to focus on different types of relationships:
- Some heads might focus on spatial structure (for image patches)
- Others on semantic meaning (for text tokens)
- Others on cross-modal alignments
Part 7: Feed-Forward Network
After attention, each token passes through a dense network:
FFN(x) = W_2 @ ReLU(W_1 @ x + b_1) + b_2
W_1 ∈ ℝ^(d_ff × d_model) (typically d_ff = 4 × d_model)
W_2 ∈ ℝ^(d_model × d_ff)
For d_model = 2048:
W_1 ∈ ℝ^(8192 × 2048)
W_2 ∈ ℝ^(2048 × 8192)
This is applied independently to each token, allowing it to learn non-linear transformations.
Summary: The Full Pipeline
Image 224×224 Text "The cat sat"
↓ ↓
256 patches 6 tokens
(14×14 pixels each) (SentencePiece)
↓ ↓
└─→ Project to d_model=2048 ←─┘
↓
262 embeddings (256 + 6)
↓
Add positional encoding
↓
Transformer Layer N times:
├─ Multi-head attention (image ↔ image, image ↔ text, text ↔ text)
└─ Feed-forward network
↓
Output embeddings
↓
Task head (language, image generation, etc.)
↓
Logits → Softmax → Predictions
The beauty: No special handling for different modalities. The Transformer treats image patches and text tokens identically, allowing natural cross-modal reasoning.