Worked Example: Tokenising an Image and Text Together
Let’s trace through the full process of converting a simple image and caption into tokens that Gemini can process.
Scenario
We have:
- Image: A 28×28 pixel photo (small, for easy calculation)
- Caption: “A cat”
- Patch size: 7×7 (to keep numbers manageable)
- d_model: 8 (simplified; real: 2048)
Step 1: Divide Image into Patches
A 28×28 image with 7×7 patches:
Number of patches = (28 / 7) × (28 / 7) = 4 × 4 = 16 patches
Imagine the image divided into a 4×4 grid:
┌─────────┬─────────┬─────────┬─────────┐
│ Patch 0 │ Patch 1 │ Patch 2 │ Patch 3 │
├─────────┼─────────┼─────────┼─────────┤
│ Patch 4 │ Patch 5 │ Patch 6 │ Patch 7 │
├─────────┼─────────┼─────────┼─────────┤
│ Patch 8 │ Patch 9 │ Patch10 │ Patch11 │
├─────────┼─────────┼─────────┼─────────┤
│Patch 12 │Patch 13 │Patch 14 │Patch 15 │
└─────────┴─────────┴─────────┴─────────┘
Each patch is 7×7×3 = 147 values.
Step 2: Embed Each Patch
Each patch is a 147-dimensional vector. We project it to d_model = 8:
W_patch ∈ ℝ^(8 × 147)
For Patch 0 (top-left, contains mostly white sky):
Raw patch values: [255, 255, 255, ... 147 times] (white pixels)
e_patch[0] = W_patch @ [255, 255, ..., 255] + b_patch
≈ [0.5, -0.2, 0.8, 0.1, -0.3, 0.6, 0.2, 0.4] (example embedding)
∈ ℝ^8
For Patch 1 (top-middle, contains edge of cat's head):
Raw patch: [200, 150, 100, ... varied pixel values ...]
e_patch[1] ≈ [0.2, 0.9, -0.1, 0.7, 0.3, -0.2, 0.5, 0.1]
∈ ℝ^8
We get 16 embeddings, each 8-dimensional:
e_patch = [[0.5, -0.2, 0.8, 0.1, -0.3, 0.6, 0.2, 0.4], # Patch 0
[0.2, 0.9, -0.1, 0.7, 0.3, -0.2, 0.5, 0.1], # Patch 1
...
[0.1, 0.3, 0.6, -0.4, 0.2, 0.7, -0.1, 0.5]] # Patch 15
Shape: (16, 8) [16 patches × 8-D embeddings]
Step 3: Tokenise Text
“A cat” is tokenised using SentencePiece:
Text: "A cat"
Tokens: ["A", "cat"] (2 tokens)
Token IDs: [15, 234] (hypothetical IDs from 256K vocabulary)
Word embeddings (from W_text ∈ ℝ^(256000 × 8)):
e_text[0] = W_text[15] ≈ [0.3, 0.1, -0.2, 0.5, 0.6, -0.1, 0.2, 0.8]
e_text[1] = W_text[234] ≈ [0.7, -0.3, 0.4, 0.2, -0.5, 0.6, 0.1, 0.3]
Shape: (2, 8) [2 tokens × 8-D embeddings]
Step 4: Concatenate All Tokens
Combine image patches and text tokens:
X = [
[0.5, -0.2, 0.8, 0.1, -0.3, 0.6, 0.2, 0.4], # Patch 0
[0.2, 0.9, -0.1, 0.7, 0.3, -0.2, 0.5, 0.1], # Patch 1
... (patches 2-15) ...
[0.3, 0.1, -0.2, 0.5, 0.6, -0.1, 0.2, 0.8], # Text token "A"
[0.7, -0.3, 0.4, 0.2, -0.5, 0.6, 0.1, 0.3] # Text token "cat"
]
Shape: (18, 8) [18 total tokens × 8-D embeddings]
Key point: Tokens 0–15 came from an image. Tokens 16–17 came from text. The model treats them identically.
Step 5: Add Positional Encodings
Compute positional encoding for each position using the formula:
pos_enc[i] = [sin(i/10000^0/d_model), cos(i/10000^2/d_model),
sin(i/10000^4/d_model), cos(i/10000^6/d_model), ...]
For d_model = 8, compute for positions 0, 1, 2, and 16:
Position 0 (Patch 0)
pos_enc[0, 0] = sin(0 / 10000^0/8) = sin(0) = 0
pos_enc[0, 1] = cos(0 / 10000^2/8) = cos(0) = 1
pos_enc[0, 2] = sin(0 / 10000^4/8) = sin(0) = 0
pos_enc[0, 3] = cos(0 / 10000^6/8) = cos(0) = 1
pos_enc[0, 4] = sin(0 / 10000^8/8) = sin(0) = 0
pos_enc[0, 5] = cos(0 / 10000^10/8) = cos(0) = 1
pos_enc[0, 6] = sin(0 / 10000^12/8) = sin(0) = 0
pos_enc[0, 7] = cos(0 / 10000^14/8) = cos(0) = 1
pos_enc[0] = [0, 1, 0, 1, 0, 1, 0, 1]
Position 1 (Patch 1)
pos_enc[1, 0] = sin(1 / 10000^0/8) = sin(1) ≈ 0.841
pos_enc[1, 1] = cos(1 / 10000^2/8) = cos(1) ≈ 0.540
pos_enc[1, 2] = sin(1 / 10000^4/8) = sin(1/10000^0.5) = sin(1/100) ≈ 0.010
pos_enc[1, 3] = cos(1 / 10000^6/8) = cos(1/10000^0.75) ≈ 1.0
pos_enc[1, 4] = sin(1 / 10000^8/8) = sin(1/10000) ≈ 0.0001
pos_enc[1, 5] = cos(1 / 10000^10/8) ≈ 1.0
pos_enc[1, 6] = sin(1 / 10000^12/8) ≈ 0
pos_enc[1, 7] = cos(1 / 10000^14/8) ≈ 1.0
pos_enc[1] ≈ [0.841, 0.540, 0.010, 1.0, 0.0001, 1.0, 0, 1.0]
Position 16 (Text Token “A”, after all image patches)
pos_enc[16, 0] = sin(16 / 1) = sin(16) ≈ -0.288
pos_enc[16, 1] = cos(16 / 1) ≈ -0.958
pos_enc[16, 2] = sin(16 / 100) ≈ 0.159
pos_enc[16, 3] = cos(16 / 100) ≈ 0.987
pos_enc[16, 4] = sin(16 / 10000) ≈ 0.0016
pos_enc[16, 5] = cos(16 / 10000) ≈ 1.0
pos_enc[16, 6] = sin(16 / 100000000) ≈ 0
pos_enc[16, 7] = cos(16 / 100000000) ≈ 1.0
pos_enc[16] ≈ [-0.288, -0.958, 0.159, 0.987, 0.0016, 1.0, 0, 1.0]
Key insight: Positions far apart (0 vs 16) have very different positional encodings, so the Transformer knows the text token comes after all the image patches.
Step 6: Add Positional Encodings to Token Embeddings
X_pos = X + pos_enc
X_pos[0] = [0.5, -0.2, 0.8, 0.1, -0.3, 0.6, 0.2, 0.4]
+ [0, 1, 0, 1, 0, 1, 0, 1]
= [0.5, 0.8, 0.8, 1.1, -0.3, 1.6, 0.2, 1.4]
X_pos[1] = [0.2, 0.9, -0.1, 0.7, 0.3, -0.2, 0.5, 0.1]
+ [0.841, 0.540, 0.010, 1.0, 0.0001, 1.0, 0, 1.0]
= [1.041, 1.440, -0.090, 1.7, 0.3001, 0.8, 0.5, 1.1]
X_pos[16] = [0.3, 0.1, -0.2, 0.5, 0.6, -0.1, 0.2, 0.8]
+ [-0.288, -0.958, 0.159, 0.987, 0.0016, 1.0, 0, 1.0]
= [0.012, -0.858, -0.041, 1.487, 0.6016, 0.9, 0.2, 1.8]
Final shape: (18, 8)
Step 7: Feed Into Transformer
Now X_pos (18 tokens, 8 dimensions each) is fed into the Transformer stack:
X_pos (18 × 8)
↓
Multi-head attention (18 tokens attend to each other)
├─ Patches 0-15 attend to each other (spatial relationships in image)
├─ Patches 0-15 attend to tokens 16-17 (image understanding text context)
└─ Tokens 16-17 attend to patches 0-15 (language grounded in image)
↓
Feed-forward network (applied per token)
↓
(Repeat N times)
↓
Output: (18, 8) embeddings ready for prediction
What Did the Model Learn?
After training, the model’s attention weights reveal:
When processing Patch 5 (likely contains part of the cat):
- High attention to patches 1, 2, 4, 6, 9, 10 (nearby patches — understanding spatial structure)
- High attention to token 16 “cat” (grounding the visual feature in language)
- Low attention to patches in background
When processing token 16 “cat”:
- High attention to patches 1, 5, 9 (where the cat appears in the image)
- Low attention to patches with just background (sky, grass)
- Moderate attention to token 17 (understanding grammar)
This cross-modal reasoning emerges automatically from the unified architecture.
Summary
| Step | Input | Output | Dimension |
|---|---|---|---|
| Patch division | 28×28×3 image | 16 patches of 7×7×3 | (16, 147) |
| Patch embedding | 16 patches (147-D) | 16 embeddings | (16, 8) |
| Text tokenisation | ”A cat” | 2 tokens | (2,) |
| Text embedding | 2 tokens | 2 embeddings | (2, 8) |
| Concatenation | Patches + text | Combined sequence | (18, 8) |
| Positional encoding | (18, 8) + positions | Position-aware embeddings | (18, 8) |
| Transformer | (18, 8) | Processed representations | (18, 8) |
The beauty of native multimodality: All steps are identical regardless of modality. No special cases, no bolted-on components. Just tokens and attention.