The Idea: Native Multimodality from the Start

Core Insight

Instead of this:

Text Model (trained for years) + Vision Encoder (bolted on) + Projection = Multimodal

Gemini does this:

Single unified model trained jointly on text + images + audio + video from day one

The key phrase in the paper: “trained end-to-end on multimodal data.” Not retrofitted. Not adapted. Built from scratch to think in multiple modalities.

The Architecture (High Level)

┌─────────────────────────────────────────┐
│ Input: Text | Image | Audio | Video     │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Unified Tokeniser                       │
│ (SentencePiece for text)                │
│ (Patch embeddings for images)           │
│ (Spectral features for audio)           │
│ (Frame + temporal for video)            │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Embedding & Projection to common space  │
│ All modalities → same d_model dimension │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Transformer Stack                       │
│ (Efficient attention, shared weights)   │
└────────┬────────────────────────────────┘
         │
         ↓
┌─────────────────────────────────────────┐
│ Task-Specific Output Heads              │
│ (Language, image generation, etc.)      │
└─────────────────────────────────────────┘

The Three Key Components

1. Unified Tokenisation

Text: Already tokenised using SentencePiece (same as LLaMA, Mistral, etc.)

Images: Split into 14×14 pixel patches. A standard 224×224 image becomes:

(224 / 14) × (224 / 14) = 16 × 16 = 256 patches

Each patch is a vector of 14×14×3 = 588 pixel values. Then:

patch_vector → Linear projection → d_model-dimensional embedding

All 256 image patches, plus the text tokens, form one unified sequence.

Audio: Raw audio waveforms are converted to spectrograms (visual representation of frequency over time), then treated similarly to images — patches in the time-frequency domain.

Video: Frames are treated as images, with temporal position encodings to show that frame 5 comes after frame 4.

Result: A single sequence of tokens, all the same size, all ready to feed into a Transformer.

2. Unified Embedding Space

Once tokenised, all tokens (text, image patches, audio features, video frames) are projected to the same embedding dimension (e.g., d_model = 2048 for Gemini Ultra).

This is crucial. It means:

The attention mechanism treats text tokens and image patches identically
The model can learn alignments naturally (word “cat” attends to cat pixels)
Compute is allocated fairly (not wastefully spending 90% on text, 10% on vision)

3. Efficient Attention

The paper mentions using sliding window attention combined with global attention, similar to ideas from earlier papers like Longformer and Mistral.

Why? Because full O(n²) attention on, say, 32,000 tokens (text + image + audio) becomes expensive. Sliding window attention:

Allows each token to attend to nearby tokens (local context)
Mixes in occasional global attention to distant tokens (preserving long-range information)
Reduces compute from O(n²) to O(n × w) where w = window size

This lets Gemini handle 32K tokens efficiently (later 1M in Gemini 1.5).

The Indian Analogy

Imagine two classrooms preparing for an inter-school debate competition:

Classroom A (Text-First Approach):

Students spend 6 months learning English — reading, writing, speaking
Teacher plays a 1-minute video about the topic
Students have to describe the video in words first, then debate using those descriptions
The video information is filtered through language

Classroom B (Unified Multimodality):

Students learn over 6 months by reading and watching videos and listening to talks and seeing diagrams — all together
When the debate topic comes up, they remember it in all modalities simultaneously
They didn’t “translate” the video into words — they learned language and vision together

Gemini is Classroom B.

Why Three Model Sizes?

The paper introduces three variants:

Gemini Nano (~2–7B parameters)

Purpose: Run on-device (Pixel phones, tablets, laptops)
Speed: Inference in milliseconds, no network needed
Trade-off: Lower accuracy on hard tasks, but sufficient for most everyday queries
Use case: “Summarise this photo,” “Translate this sign,” on your phone without internet

Gemini Pro (~50B parameters, estimate)

Purpose: The “sweet spot” for production
Speed: Fast enough for real-time web use, deployed on Google’s servers
Accuracy: Excellent on most benchmarks, good balance
Use case: Google Search answers, Bard conversations, Gmail Smart Compose

Gemini Ultra (~1.3T parameters, estimate)

Purpose: Research and most demanding tasks
Speed: Slower, requires a lot of compute, used for batch processing or offline tasks
Accuracy: State-of-the-art on nearly all benchmarks
Use case: Generating complex documents, scientific analysis, creative writing

This is a spectrum of the same architecture, not three separate models. The core ideas are the same; only scale and inference speed differ.

Training Process (Conceptually)

Collect multimodal data: Billions of examples of (text, images), (image, text, caption), (video, audio, caption), etc.
Tokenise everything: Convert all modalities to a token sequence
Train with language modeling loss: Predict the next token (whether text or image patch)
Train with masked modeling: Hide some patches or words, predict them back
Train with contrastive learning: Make the model understand that “cat” and a cat image go together

The key difference from prior work: All modalities are trained jointly from the start, not sequentially.

A Critical Detail: Emergent Multimodal Capabilities

The paper claims something remarkable: The model learns to handle modalities it wasn’t explicitly trained on.

For example:

Trained on (text, image) pairs? The model can suddenly handle (text, image, audio) triplets without specific fine-tuning
Trained on image captioning? The model can do image-to-image reasoning

This “emergence” is a known property of scale in LLMs, and Gemini benefits from the same principle.

Why This Approach Works

Shared vocabulary: All modalities speak the same “language” (tokens)
Shared attention: The model’s intelligence (attention weights) applies to all modalities equally
Joint optimization: The loss function optimises for understanding across modalities simultaneously
Scale efficiency: Using one model (not separate text + vision + audio models) is more parameter-efficient

Next: The Math: How Tokenisation and Embeddings Work