Paper 20

Further Reading — Gemini: A Family of Highly Capable Multimodal Models

Further Reading: Gemini and Multimodal AI

Original Paper & Reports

  1. Gemini: A Family of Highly Capable Multimodal Models (2023)
    https://arxiv.org/abs/2312.11805
    Official technical report. Read this after you understand the basics — it’s dense but contains all official claims.

  2. Gemini 1.5: Unlocking Multimodal Understanding at Scale (2024)
    https://arxiv.org/abs/2403.05530
    The follow-up: 1M token context, improved performance. Shows how quickly the field iterated.

  3. Gemma: Open Models Based on Gemini Research and Technology (2024)
    https://arxiv.org/abs/2403.08295
    Google’s open-source derivatives of Gemini. 2B, 7B, and 13B variants. Good for understanding how Google scaled down from Ultra.


Foundational Papers (Understand These First)

  1. Attention Is All You Need (Vaswani et al., 2017)
    https://arxiv.org/abs/1706.10677
    The original Transformer. Essential prerequisite. All modern models (Gemini, Mamba, Claude) build on this.

  2. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale (Dosovitskiy et al., 2021)
    https://arxiv.org/abs/2010.11929
    Vision Transformer (ViT). Explains how images can be tokenized into patches, which Gemini uses.

  3. Language Models are Unsupervised Multitask Learners (Radford et al., 2019)
    https://arxiv.org/abs/1901.11990
    GPT-2 / GPT-3 predecessor. Understand how language modeling scales; Gemini uses the same approach.


Efficient Attention & Long Context (Why Gemini Needed These)

  1. Longformer: The Long-Document Transformer (Beltagy et al., 2020)
    https://arxiv.org/abs/2004.04610
    Introduces sliding window + global attention for longer sequences. Gemini uses similar ideas for 32K context.

  2. Efficient Transformers: A Survey (Tay et al., 2022)
    https://arxiv.org/abs/2202.11556
    Comprehensive survey of O(n log n) and O(n) attention variants. Understand what “efficient attention” means.

  3. Ring Attention with Blockwise Transformers (Liu et al., 2024)
    https://arxiv.org/abs/2310.01889
    Parallel attention across devices. Related to how Gemini handles massive models across TPU clusters.


Multimodal & Vision-Language Models (Competition & Evolution)

  1. Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al., 2022)
    https://arxiv.org/abs/2204.14198
    DeepMind’s earlier multimodal model. Shows the “bolt-on vision” approach that Gemini improved upon.

  2. CLIP: Learning Transferable Models for Computer Vision from Natural Language Supervision (Radford et al., 2021)
    https://arxiv.org/abs/2103.14030
    Text-image alignment model. Influenced how multimodal models learn shared representations.

  3. GPT-4V Technical Report (OpenAI, 2023)
    https://arxiv.org/abs/2310.03743
    OpenAI’s multimodal approach. Competing design to Gemini’s native multimodality.

  4. LLaVA: Large Language and Vision Assistant (Liu et al., 2023)
    https://arxiv.org/abs/2304.08485
    Open-source vision-language model. Shows how to build on open foundations (LLaMA + vision encoder).

  5. Unified-IO: Unifying Vision, Text, and Cross-Modal Tasks with a Single Model (Lu et al., 2022)
    https://arxiv.org/abs/2206.08919
    Early attempt at truly unified multimodal. Relevant for understanding Gemini’s vision.


Benchmarks & Evaluation (Understanding the Numbers)

  1. MMLU: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)
    https://arxiv.org/abs/2009.03300
    The benchmark Gemini’s 90.04% score is measured on. Understand what the 57 subjects are and why this benchmark matters.

  2. Evaluating Large Language Models Trained on Code (Chen et al., 2021)
    https://arxiv.org/abs/2107.03374
    HumanEval benchmark (code generation). Gemini scores 74.4% on this.

  3. GSM8K: Training Verifiable Graders for Mathematics Student Homework (Cobbe et al., 2021)
    https://arxiv.org/abs/2110.14168
    Grade-school math benchmark. Gemini scores 94.4% — above human expert baseline.


Training & Scaling (How Gemini Was Built)

  1. PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022)
    https://arxiv.org/abs/2204.02311
    Google’s Pathways framework for multi-task training. Gemini likely uses Pathways (mentioned in the paper).

  2. The Compute Optimal Scaling Laws for Large Language Models (Hoffmann et al., 2022)
    https://arxiv.org/abs/2203.15556
    Chinchilla scaling laws. Understanding compute-optimal model sizing (why Gemini has specific parameter counts).

  3. Scaling Laws for Transfer (Kaplan et al., 2020)
    https://arxiv.org/abs/2102.06171
    Original GPT-3 scaling laws. Relevant for understanding how Gemini was sized (Ultra, Pro, Nano).


Data & Contamination (Understanding the Concerns)

  1. Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design (Webson & Pavlick, 2021)
    https://arxiv.org/abs/2109.07686
    Early work on prompt sensitivity. Relevant to understanding why benchmark contamination matters.

  2. Documenting Dataset Provenance for Natural Language Processing (Pushkarna et al., 2022)
    https://arxiv.org/abs/2201.08836
    Framework for understanding data provenance. Gemini’s training data is largely undisclosed; this paper shows why transparency matters.


On-Device & Efficient Models (Gemini Nano Direction)

  1. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices (Sun et al., 2020)
    https://arxiv.org/abs/2004.02984
    How to compress models for phones. Gemini Nano uses similar ideas.

  2. TinyLLaMA: An Open-Source Small Language Model (Zhang et al., 2024)
    https://arxiv.org/abs/2401.02385
    Recent small model. Compare with Gemini Nano approaches.


  1. Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Gu & Dao, 2023)
    https://arxiv.org/abs/2312.00752
    The next paper in this series. Post-Gemini, researchers explore alternatives to Transformers.

  2. Jamba: A Hybrid Transformer-Mamba Language Model (Lieber et al., 2024)
    https://arxiv.org/abs/2403.19887
    Hybrid approach: combines Mamba (linear-time) with Transformer (full attention). Inspired by Gemini’s efficiency questions.

  3. The Llama 3 Herd of Models (Meta, 2024)
    https://arxiv.org/abs/2407.21783
    Meta’s multimodal push post-Gemini. Shows industry-wide shift toward multimodality.


Practical Guides (Using Gemini)

  1. Google AI Studio: Quick Start (Official, 2024)
    https://ai.google.dev/gemini-api/docs/quick-start
    Official guide to calling Gemini API. Hands-on.

  2. Vertex AI Gemini API (Official, 2024)
    https://cloud.google.com/docs/gemini/vision-overview
    Enterprise guide. For production use at scale.

  3. LangChain Gemini Integration (Official, 2024)
    https://js.langchain.com/docs/integrations/llms/google_genai
    How to use Gemini in LangChain (popular Python/JS framework for building AI apps).


Blog Posts & Commentary

  1. Google Announces Gemini: Its New AI Model (Google Official Blog, December 2023)
    https://blog.google/technology/ai/google-gemini-ai/
    Official announcement. Marketing framing, but covers key points.

  2. Why Gemini’s MMLU Score Needs Scrutiny (Open Philanthropy Analysis, 2024)
    https://www.openphilanthropy.org/research/ai-benchmarks/
    Critical analysis of benchmark claims. Important for understanding limitations.


Broader Context: The AI Race in 2023–2024

  1. The Bitter Lesson (Rich Sutton, 2019)
    http://www.incompleteideas.net/IncIdeas/BitterLesson.html
    Meta-lesson about AI research: scale beats domain knowledge. Explains why Gemini (massive scale) succeeded.

  2. Superintelligence: Paths, Dangers, Strategies (Nick Bostrom, 2014)
    https://www.amazon.com/Superintelligence-Paths-Dangers-Strategies-Bostrom/dp/0199678871
    Philosophical context: what does it mean when AI models exceed human expertise on benchmarks?


  1. Mistral 7B (Jiang et al., 2023)
    https://arxiv.org/abs/2310.06825
    Efficient small model. Shows different scaling approach than Gemini’s three-size strategy.

  2. Claude 3 Model Card (Anthropic, 2024)
    https://arxiv.org/abs/2402.04306
    Competing multimodal model from Anthropic. Compare training approaches.

  3. OLMo: Accelerating the Science of Language Models (Groeneveld et al., 2024)
    https://arxiv.org/abs/2402.00838
    Fully open, reproducible language model. Contrast with Gemini’s closed training process.


Beginner (2-3 hours):

  1. Read this Gemini paper summary
  2. Watch: “What is a Transformer?” (3Blue1Brown or similar)
  3. Read Paper 8 (Attention Is All You Need)

Intermediate (4-5 hours): 4. Read Paper 5 (Vision Transformers) 5. Read Gemini 1.5 technical report 6. Run the provided Python code on Gemini API

Advanced (6+ hours): 7. Read the full Gemini technical report (arxiv link above) 8. Read competing papers (GPT-4V, Claude 3, LLaVA) 9. Understand scaling laws (Papers 19, 20) 10. Read Mamba paper (Paper 21, next in this series)


Paper: Gemini: A Family of Highly Capable Multimodal Models
Previous: Paper 19 (Ring Attention)
Next: Paper 21 (Mamba)
Math Tutorial: Eigenvalues & Eigenvectors