The Idea: Native Multimodality from the Start
Core Insight
Instead of this:
Text Model (trained for years) + Vision Encoder (bolted on) + Projection = Multimodal
Gemini does this:
Single unified model trained jointly on text + images + audio + video from day one
The key phrase in the paper: “trained end-to-end on multimodal data.” Not retrofitted. Not adapted. Built from scratch to think in multiple modalities.
The Architecture (High Level)
┌─────────────────────────────────────────┐
│ Input: Text | Image | Audio | Video │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Unified Tokeniser │
│ (SentencePiece for text) │
│ (Patch embeddings for images) │
│ (Spectral features for audio) │
│ (Frame + temporal for video) │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Embedding & Projection to common space │
│ All modalities → same d_model dimension │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Transformer Stack │
│ (Efficient attention, shared weights) │
└────────┬────────────────────────────────┘
│
↓
┌─────────────────────────────────────────┐
│ Task-Specific Output Heads │
│ (Language, image generation, etc.) │
└─────────────────────────────────────────┘
The Three Key Components
1. Unified Tokenisation
Text: Already tokenised using SentencePiece (same as LLaMA, Mistral, etc.)
Images: Split into 14×14 pixel patches. A standard 224×224 image becomes:
- (224 / 14) × (224 / 14) = 16 × 16 = 256 patches
Each patch is a vector of 14×14×3 = 588 pixel values. Then:
patch_vector → Linear projection → d_model-dimensional embedding
All 256 image patches, plus the text tokens, form one unified sequence.
Audio: Raw audio waveforms are converted to spectrograms (visual representation of frequency over time), then treated similarly to images — patches in the time-frequency domain.
Video: Frames are treated as images, with temporal position encodings to show that frame 5 comes after frame 4.
Result: A single sequence of tokens, all the same size, all ready to feed into a Transformer.
2. Unified Embedding Space
Once tokenised, all tokens (text, image patches, audio features, video frames) are projected to the same embedding dimension (e.g., d_model = 2048 for Gemini Ultra).
This is crucial. It means:
- The attention mechanism treats text tokens and image patches identically
- The model can learn alignments naturally (word “cat” attends to cat pixels)
- Compute is allocated fairly (not wastefully spending 90% on text, 10% on vision)
3. Efficient Attention
The paper mentions using sliding window attention combined with global attention, similar to ideas from earlier papers like Longformer and Mistral.
Why? Because full O(n²) attention on, say, 32,000 tokens (text + image + audio) becomes expensive. Sliding window attention:
- Allows each token to attend to nearby tokens (local context)
- Mixes in occasional global attention to distant tokens (preserving long-range information)
- Reduces compute from O(n²) to O(n × w) where w = window size
This lets Gemini handle 32K tokens efficiently (later 1M in Gemini 1.5).
The Indian Analogy
Imagine two classrooms preparing for an inter-school debate competition:
Classroom A (Text-First Approach):
- Students spend 6 months learning English — reading, writing, speaking
- Teacher plays a 1-minute video about the topic
- Students have to describe the video in words first, then debate using those descriptions
- The video information is filtered through language
Classroom B (Unified Multimodality):
- Students learn over 6 months by reading and watching videos and listening to talks and seeing diagrams — all together
- When the debate topic comes up, they remember it in all modalities simultaneously
- They didn’t “translate” the video into words — they learned language and vision together
Gemini is Classroom B.
Why Three Model Sizes?
The paper introduces three variants:
Gemini Nano (~2–7B parameters)
- Purpose: Run on-device (Pixel phones, tablets, laptops)
- Speed: Inference in milliseconds, no network needed
- Trade-off: Lower accuracy on hard tasks, but sufficient for most everyday queries
- Use case: “Summarise this photo,” “Translate this sign,” on your phone without internet
Gemini Pro (~50B parameters, estimate)
- Purpose: The “sweet spot” for production
- Speed: Fast enough for real-time web use, deployed on Google’s servers
- Accuracy: Excellent on most benchmarks, good balance
- Use case: Google Search answers, Bard conversations, Gmail Smart Compose
Gemini Ultra (~1.3T parameters, estimate)
- Purpose: Research and most demanding tasks
- Speed: Slower, requires a lot of compute, used for batch processing or offline tasks
- Accuracy: State-of-the-art on nearly all benchmarks
- Use case: Generating complex documents, scientific analysis, creative writing
This is a spectrum of the same architecture, not three separate models. The core ideas are the same; only scale and inference speed differ.
Training Process (Conceptually)
- Collect multimodal data: Billions of examples of (text, images), (image, text, caption), (video, audio, caption), etc.
- Tokenise everything: Convert all modalities to a token sequence
- Train with language modeling loss: Predict the next token (whether text or image patch)
- Train with masked modeling: Hide some patches or words, predict them back
- Train with contrastive learning: Make the model understand that “cat” and a cat image go together
The key difference from prior work: All modalities are trained jointly from the start, not sequentially.
A Critical Detail: Emergent Multimodal Capabilities
The paper claims something remarkable: The model learns to handle modalities it wasn’t explicitly trained on.
For example:
- Trained on (text, image) pairs? The model can suddenly handle (text, image, audio) triplets without specific fine-tuning
- Trained on image captioning? The model can do image-to-image reasoning
This “emergence” is a known property of scale in LLMs, and Gemini benefits from the same principle.
Why This Approach Works
- Shared vocabulary: All modalities speak the same “language” (tokens)
- Shared attention: The model’s intelligence (attention weights) applies to all modalities equally
- Joint optimization: The loss function optimises for understanding across modalities simultaneously
- Scale efficiency: Using one model (not separate text + vision + audio models) is more parameter-efficient