The Math: Tokenisation, Embeddings, and Attention

Prerequisites: Matrix Multiplication and Projections, Softmax and Attention

Part 1: Image Tokenisation

Step 1: Patch Division

An image I ∈ ℝ^(H × W × C) (height × width × channels) is divided into non-overlapping patches.

For a standard 224×224 RGB image with patch size 14×14:

Number of patches = (H / patch_size) × (W / patch_size)
                  = (224 / 14) × (224 / 14)
                  = 16 × 16
                  = 256 patches

Each patch P_ij is a 14×14×3 = 588-dimensional vector (RGB values flattened).

Step 2: Patch Embedding

Each patch is projected to d_model dimensions using a learned linear layer:

e_patch = W_patch @ flatten(P_ij) + b_patch

Where:

W_patch ∈ ℝ^(d_model × 588) is the patch projection weight matrix
e_patch ∈ ℝ^d_model is the embedded patch

For Gemini, d_model ≈ 2048 for Ultra.

Worked Numerical Example

Setup:

Image size: 224×224 (standard)
Patch size: 14×14
d_model: 2048 (Gemini Ultra)

Calculation:

Number of patches:

patches = (224 / 14) × (224 / 14) = 16 × 16 = 256

Size of one patch vector:

patch_vector_size = 14 × 14 × 3 = 196 × 3 = 588

Embedding matrix dimensions:

W_patch ∈ ℝ^(2048 × 588)  (projects 588-D patch → 2048-D embedding)

Example patch embedding (concrete numbers):

One 14×14 pixel patch in RGB:
P = [
  [255, 128, 64],      # top-left pixel (reddish)
  [254, 127, 63],      # next pixel
  ... 194 more pixels ...
]

Flattened: [255, 128, 64, 254, 127, 63, ..., 200, 100, 50] (588 values)

e_patch = W_patch @ patch + b_patch
        ∈ ℝ^2048

After processing all 256 patches, we have a sequence of 256 embeddings, each 2048-dimensional.

Part 2: Text Tokenisation

Text uses SentencePiece, a subword tokeniser with 256K vocabulary.

For example:

Text: "The cat sat on the mat"
Tokens: [The, cat, sat, on, the, mat]  (6 tokens)

Embeddings: 
  e_text[0] = W_text[token_id("The")] ∈ ℝ^2048
  e_text[1] = W_text[token_id("cat")] ∈ ℝ^2048
  ...

Where W_text ∈ ℝ^(256000 × 2048) is the word embedding matrix (256K vocabulary × 2048 dimensions).

Part 3: Token Concatenation

Now we have:

256 image patch embeddings (each ∈ ℝ^2048)
6 text token embeddings (each ∈ ℝ^2048)

These are concatenated into one sequence:

X = [e_patch[0], e_patch[1], ..., e_patch[255], 
     e_text[0], e_text[1], ..., e_text[5]]

X ∈ ℝ^(262 × 2048)  (262 tokens total, each 2048-D)

The Transformer doesn’t care that tokens 0–255 came from an image and 256–261 came from text. They’re all just tokens.

Part 4: Positional Encoding

The Transformer needs to know the order of tokens. Gemini uses absolute positional encodings (like most Transformers):

pos_enc[i] = [
  sin(i / 10000^(0/d_model)),
  cos(i / 10000^(2/d_model)),
  sin(i / 10000^(4/d_model)),
  ...
]

pos_enc[i] ∈ ℝ^d_model

Each token’s embedding is added to its positional encoding:

X_pos[i] = X[i] + pos_enc[i]

Worked Example (Smaller Dimension)

Let d_model = 4 (simplified; real: 2048). Compute positional encoding for position i=0 and i=1:

Position 0:

pos_enc[0] = [sin(0/1), cos(0/1), sin(0/100), cos(0/100)]
           = [sin(0), cos(0), sin(0), cos(0)]
           = [0, 1, 0, 1]

Position 1:

pos_enc[1] = [sin(1/1), cos(1/1), sin(1/100), cos(1/100)]
           = [sin(1), cos(1), sin(0.01), cos(0.01)]
           ≈ [0.841, 0.540, 0.010, 0.999]  (computed exactly)

So if token 0 (image patch) had embedding [0.5, 0.2, 0.3, 0.1], it becomes:

[0.5, 0.2, 0.3, 0.1] + [0, 1, 0, 1] = [0.5, 1.2, 0.3, 1.1]

Part 5: Attention Mechanism

The core of the Transformer. For a sequence of length n (262 in our example):

Query, Key, Value Projections

Each token embedding X_pos[i] is projected into three spaces:

Q[i] = W_Q @ X_pos[i]           (Query, ∈ ℝ^d_head)
K[i] = W_K @ X_pos[i]           (Key, ∈ ℝ^d_head)
V[i] = W_V @ X_pos[i]           (Value, ∈ ℝ^d_head)

Where W_Q, W_K, W_V ∈ ℝ^(d_head × d_model)
Typically d_head = d_model / num_heads, e.g., 2048 / 32 = 64

Attention Weights

For each position i, compute similarity to all other positions:

scores[i, j] = Q[i] · K[j] / √d_head    (dot product similarity)

For i=100 (an image patch), it attends to:
  j=0-255 (other image patches)
  j=256-261 (text tokens)

All treated equally by the math.

Softmax Normalization

attn_weights[i, j] = exp(scores[i, j]) / Σ_k exp(scores[i, k])

For our example with 262 tokens:
  Denominator = Σ_{k=0}^{261} exp(scores[i, k])

Weighted Sum

The output for position i is the weighted sum of all values:

attn_output[i] = Σ_j attn_weights[i, j] × V[j]

Worked Attention Example

Simplified: 3 tokens (image patch + 2 text words), d_head = 2.

Q[0] = [0.5, 0.3]    (image patch query)
Q[1] = [0.2, 0.8]    (text word "cat" query)
Q[2] = [0.7, 0.1]    (text word "sat" query)

K[0] = [0.6, 0.4]
K[1] = [0.3, 0.9]
K[2] = [0.5, 0.2]

V[0] = [1.0, 0.0]
V[1] = [0.0, 1.0]
V[2] = [0.5, 0.5]

Compute attention for token 0 (image patch):

scores[0, 0] = Q[0] · K[0] / √2 = (0.5×0.6 + 0.3×0.4) / √2 = 0.42 / 1.414 ≈ 0.297
scores[0, 1] = Q[0] · K[1] / √2 = (0.5×0.3 + 0.3×0.9) / √2 = 0.42 / 1.414 ≈ 0.297
scores[0, 2] = Q[0] · K[2] / √2 = (0.5×0.5 + 0.3×0.2) / √2 = 0.31 / 1.414 ≈ 0.219

Softmax:

exp(0.297) ≈ 1.345
exp(0.297) ≈ 1.345
exp(0.219) ≈ 1.245

Sum = 3.935

attn_weights[0, :] = [1.345/3.935, 1.345/3.935, 1.245/3.935]
                   ≈ [0.342, 0.342, 0.316]

These sum to 1.0 (normalized).

Output:

attn_output[0] = 0.342 × V[0] + 0.342 × V[1] + 0.316 × V[2]
               = 0.342 × [1, 0] + 0.342 × [0, 1] + 0.316 × [0.5, 0.5]
               = [0.342, 0] + [0, 0.342] + [0.158, 0.158]
               = [0.5, 0.5]

Key insight: The image patch (token 0) has learned to blend information from all three tokens based on how similar its query is to their keys. The weights (0.342, 0.342, 0.316) show it attends roughly equally to the image and first text word, slightly less to the second text word.

Part 6: Multi-Head Attention

In practice, Gemini uses multiple “heads” (typically 32 or 64) that attend independently:

For each head h:
  attn_h = Attention(Q_h, K_h, V_h)

Multi-head output = Concat(attn_0, attn_1, ..., attn_31) @ W_o

Where W_o ∈ ℝ^(d_model × d_model) recombines the heads.

This allows different heads to focus on different types of relationships:

Some heads might focus on spatial structure (for image patches)
Others on semantic meaning (for text tokens)
Others on cross-modal alignments

Part 7: Feed-Forward Network

After attention, each token passes through a dense network:

FFN(x) = W_2 @ ReLU(W_1 @ x + b_1) + b_2

W_1 ∈ ℝ^(d_ff × d_model)   (typically d_ff = 4 × d_model)
W_2 ∈ ℝ^(d_model × d_ff)

For d_model = 2048:
  W_1 ∈ ℝ^(8192 × 2048)
  W_2 ∈ ℝ^(2048 × 8192)

This is applied independently to each token, allowing it to learn non-linear transformations.

Summary: The Full Pipeline

Image 224×224          Text "The cat sat"
     ↓                       ↓
256 patches            6 tokens
(14×14 pixels each)    (SentencePiece)
     ↓                       ↓
     └─→ Project to d_model=2048 ←─┘
           ↓
     262 embeddings (256 + 6)
           ↓
     Add positional encoding
           ↓
     Transformer Layer N times:
       ├─ Multi-head attention (image ↔ image, image ↔ text, text ↔ text)
       └─ Feed-forward network
           ↓
     Output embeddings
           ↓
     Task head (language, image generation, etc.)
           ↓
     Logits → Softmax → Predictions

The beauty: No special handling for different modalities. The Transformer treats image patches and text tokens identically, allowing natural cross-modal reasoning.

Next: Worked Example: End-to-End Tokenisation