Section 09

Summary: Gemini in One Sentence

Gemini: A Family of Highly Capable Multimodal Models 2023

Summary: Gemini in One Sentence

Gemini is Google’s unified multimodal model that tokenises text, images, audio, and video identically, feeds them through a single Transformer with efficient attention, and achieves 90% on MMLU — proving that native multimodality beats bolted-on vision adapters.


Core Idea Recap

AspectDetails
ProblemText-first models bolt on vision inefficiently; true multimodal reasoning requires joint training
SolutionTrain one model from scratch on text + images + audio + video simultaneously
ArchitectureUnified tokenisation → shared embedding space → Transformer with efficient attention
Key AchievementGemini Ultra: 90.04% on MMLU (first non-OpenAI model to exceed human expert baseline)
SizesNano (on-device), Pro (balanced), Ultra (most capable)
ImpactRestored Google’s credibility in AI, sparked competition, led to Gemini 1.5 (1M context), influenced industry toward multimodal-first design

Indian Analogy Recap

Gemini is like a student who learned language, visual reasoning, and audio understanding together from the start — not a student who memorized English vocabulary first, then tried to understand maps later. This simultaneous learning creates fluency that bolted-on approaches can’t match.


The Math, Simply

  1. Images → split into 14×14 patches (256 patches for 224×224 image)
  2. All modalities → embedded to same dimension (d_model = 2048 for Ultra)
  3. All tokens → positional encoding added
  4. Attention → each token attends to all others (image patches attend to text, text to images)
  5. Output → next token predicted (text, image patch, or other modality)

What Changed in AI

Before Gemini (2023):

  • OpenAI had momentum with GPT-4
  • Multimodal was “bolt-on vision to language models”
  • Context lengths were limited (4K–128K tokens)

After Gemini (2024+):

  • Google proved it could compete
  • Multimodal-from-scratch became the industry standard
  • Context lengths exploded (Gemini 1.5: 1M tokens)
  • Open-source multimodal models (Gemma) became accessible

Three Key Numbers

  1. 90.04% — MMLU score, exceeding human experts (89.8%)
  2. 32K → 1M tokens — Gemini 1.0 to 1.5 context leap
  3. 0.0005 per 1K — Price point, 100x cheaper than GPT-4

If You Remember Nothing Else

  1. Gemini processes text, images, audio, video as one unified language (not separate streams)
  2. This works because all modalities are tokenised identically and fed to the same Transformer
  3. The result: multimodal reasoning that’s more efficient and capable than “vision encoder + language model” approaches

What Came Next

  • Gemini 1.5 (May 2024): 1M token context; better understanding of long documents and code
  • Gemma (July 2024): Open-source Gemini derivatives; 2B to 13B parameters
  • Claude 3, GPT-4V iterates: Industry-wide push for better multimodal models
  • Mamba (2024): Linear-time alternative to Transformers (next paper)

How to Deepen Your Understanding

  1. Read Gemini’s competitors: Paper 18 (Mistral) for efficient attention ideas, Paper 19 (Ring Attention) for long-context techniques
  2. Understand Vision: If multimodality interests you, read about Vision Transformers (ViT)
  3. Follow-up: Read about Gemini 1.5 and Gemma (released as open papers)

Paper 19: Ring Attention
You are here: Paper 20 — Gemini
Paper 21: Mamba


Discussion Questions

  1. Why did Google choose “native multimodality” over the faster approach of bolting vision onto an existing language model?

    • Because native multimodality leads to better understanding of cross-modal relationships (word “cat” with cat image align naturally in the embedding space)
  2. If Gemini has 32K tokens and GPT-4 Turbo has 128K, why is Gemini considered better?

    • Gemini 1.5 (released later) has 1M tokens. But even early Gemini competed on quality (MMLU score) rather than context length. Different trade-offs for different tasks.
  3. Why is the price (0.0005 per 1K tokens) so much lower than GPT-4 (0.03)?

    • Google has massive compute infrastructure (TPUs). Also strategic pricing to gain market share. Prices typically drop as models become more efficient.
  4. What’s the difference between “native multimodality” and “bolted-on vision”?

    • Native: All modalities trained together from day one; the model learns cross-modal alignments naturally (e.g., “cat” embedding aligns with cat-image embedding)
    • Bolted-on: Text model trained first, then vision encoder added and fine-tuned; the two parts don’t truly understand each other
  5. Why does Gemini’s 1M-token context (in 1.5) matter?

    • You can now summarise a full novel, analyse a large codebase, or process 12 hours of meeting transcripts in one prompt. No previous model could do this.

Final thought: Gemini’s story is the story of frontier AI in 2023–2024: rapid iteration, competitive pressure driving innovation, and the shift from “one company dominates” (OpenAI) to “multiple capable models exist” (Gemini, Claude, LLaMA). This competition is good for everyone — prices fall, quality rises, and capabilities expand. The best time to learn AI is now.

🎉 You've finished this paper!