Section 01

Context: Why Did Google Build Gemini?

Gemini: A Family of Highly Capable Multimodal Models 2023

Context: Why Did Google Build Gemini?

The GPT-4 Shock

In March 2023, OpenAI released GPT-4. It wasn’t just incrementally better than GPT-3.5 — it could look at images, understand complex documents, solve math problems, and write code. The AI world collectively gasped. Google, which had invented Transformers and released BERT, suddenly found itself playing catch-up.

At the same time, Google was fragmented: Bard (a text chatbot), competing LLaMA leaks from Meta, and internally, separate teams building language models and vision models. Meanwhile, rival labs had:

  • OpenAI: GPT-4 (multimodal, confident, accessible)
  • Meta: LLaMA (open-source, lightweight)
  • Anthropic: Claude (began scaling, safety-focused)

For Google, this was a strategic crisis. The company that had the compute, the data, and the expertise was being outpaced in public perception.

The Multimodal Vision

Historically, computer vision and NLP were separate fiefdoms:

  • Vision-only teams built image classifiers, object detectors (YOLO, R-CNN)
  • NLP teams built language models (BERT, GPT)
  • Multimodal attempts bolted these together: CLIP (text → image search), BLIP (vision encoder + language decoder)

The problem: bolted-on multimodality is inefficient. You train a vision encoder (ResNet, ViT) to understand images, then you train a language model on text, then you connect them with a small projection layer. The two parts don’t truly “understand” each other — they’re translating.

By late 2023, the question was obvious: What if you trained a single model from the start that understood text, images, audio, and video as one unified language?

This is what Gemini attempted.

Key Players and Timeline

Google DeepMind (newly merged from Google Brain + DeepMind) assembled a massive team:

  • Demis Hassabis (CEO, DeepMind) — guiding long-term vision
  • Sundar Pichai (CEO, Google) — product direction
  • Koray Kavukcuoglu (VP, DeepMind) — research leadership
  • Dozens of engineers from both organizations

The paper lists the Gemini Team as “Gemini Team Google” (unusual for a research paper — normally individual authors are named). This signals enormous organizational effort.

Timeline

  • 2022–2023: Google DeepMind merger announced. Separate teams working on LLMs and vision models begin coordinating.
  • May 2023: Rumors of “Gemini” project leak.
  • July 2023: Google’s Product Summit hints at major announcements.
  • December 3, 2023: Gemini announced. Technical report published.

The Benchmark Context

When Gemini was announced, the big question was: Can Google’s model beat GPT-4?

The metric everyone watched: MMLU (Massive Multitask Language Understanding) — a benchmark of 57 diverse academic subjects (history, law, science, medicine, etc.) with 14,042 multiple-choice questions.

  • GPT-4: 86.4%
  • Human expert baseline: 89.8% (humans with subject-matter expertise)
  • Gemini Ultra claim: 90.04% — first model to exceed the human baseline

This was significant PR, though later scrutiny suggested the comparison may not have been entirely fair (data contamination, different evaluation protocols). But at the time, it was a watershed moment: a non-OpenAI model had exceeded GPT-4 and humans.

Why Multimodality Matters

Before Gemini, most frontier LLMs were text-only. To use vision, you had to:

  1. Convert an image to text (describe it in words)
  2. Or use a separate vision API
  3. Or use a model like GPT-4V (which was trained as text-first, then vision bolted on)

This is inefficient for tasks like:

  • Reading a handwritten exam paper (text + layout + handwriting style)
  • Understanding a scientific diagram (text labels + visual relationships)
  • Analyzing charts and graphs (visual representation + data)
  • Processing videos (visual + temporal + audio)

A natively multimodal model that sees these modalities together from the start can:

  • Learn cross-modal alignments naturally (a word and an image feature should activate together)
  • Allocate compute efficiently (don’t oversample text, undersample images)
  • Reason about relationships (how does this visual element relate to this description?)

Setting the Stage

By December 2023, the landscape was:

  • OpenAI leading in public perception and capability
  • Google responding with massive internal coordination
  • Meta pushing open-source (LLaMA)
  • Anthropic focused on safety and reliability
  • The race for multimodality just beginning

Gemini was Google’s bet: that unified, natively multimodal training would prove superior to the text-first approach.


Next: The Problem: Why Bolted-On Vision Isn’t Enough