Limitations: What Gemini Can’t Do (Yet)

The technical report claims native multimodality and state-of-the-art performance, but the reality is more nuanced. Here are the real limitations:

1. Sparse Technical Disclosure

The Gemini paper is a technical report, not a peer-reviewed research paper. This means:

Architecture details are vague: The paper doesn’t fully explain how efficient attention works or how multimodal training data is constructed
Training data not disclosed: No breakdown of text vs. image vs. audio vs. video in the training mix
Hyperparameters hidden: Learning rate, batch size, training duration — all proprietary
Reproducibility impossible: No open-source implementation or pre-trained weights (unlike Meta’s LLaMA)

Consequence: Researchers can’t verify claims or build on Gemini’s work easily. This is different from papers like “Attention Is All You Need” (Transformer), which disclosed enough to let thousands of teams reproduce it.

2. Benchmark Data Contamination Concerns

Gemini Ultra’s claim of 90.04% on MMLU (exceeding human experts at 89.8%) was later questioned:

MMLU is a public benchmark with answers available online
If Gemini’s training data scraped the web broadly, it might have seen MMLU questions
Even a small amount of contamination could inflate scores
Google later re-ran the benchmark more carefully, but full details weren’t published

Consequence: The “first model to beat humans” headline is less solid than initially claimed.

3. Context Length: Trailing Competition

Gemini Ultra’s initial release: 32K tokens (same as GPT-3.5, worse than GPT-4 Turbo’s 128K).

To put this in perspective:

A typical article: ~2K tokens
A short book: ~100K tokens
A full meeting transcript: ~50K tokens

With 32K context, Gemini can’t:

Process a full novel
Summarize a full week of meeting transcripts
Reason over a large codebase

Good news: Gemini 1.5 (released 2024) extended this to 1 million tokens, catching up. But at the initial announcement, this was a notable gap.

4. Weak Nano Model

Gemini Nano (2–7B parameters) is designed for on-device use. But:

Accuracy drops significantly compared to Pro and Ultra
Struggles with complex reasoning (math, code, nuance)
Language understanding is basic (single-turn conversations work better than multi-turn)

For a student using Gemini Nano on their phone to help with homework:

Easy tasks (summarize a paragraph): ✓ Works well
Hard tasks (explain a concept in depth): ✗ Often too simple

Nano is useful for quick tasks, but not a substitute for Pro or Ultra.

5. The Multimodal Claim Isn’t Fully Verified

The paper claims “trained end-to-end on multimodal data,” but:

Were all three model sizes (Ultra, Pro, Nano) trained jointly on multimodal data?
Or was multimodal training only for Ultra, with Pro and Nano derived later?
How much of the training data was actually multimodal vs. text-only?

The paper doesn’t clarify, and researchers have speculated that the “multimodal” framing might be marketing. Gemini may have significant text-only pre-training phases.

6. Delayed Ultra Release (Initially)

The announcement happened December 2023, but:

Gemini Ultra was announced but not immediately available to the public
Users had access to Gemini Pro (via Google Bard)
Gemini Ultra came later via Google One AI Premium (subscription)
This meant people couldn’t test the headline claims for weeks

Consequence: The benchmark claims (90.04% MMLU) couldn’t be independently verified immediately. Trust eroded a bit.

7. Struggling with “In-Context Recall”

A known weakness: retrieving specific information from earlier in a long context.

Example:

Context: "The capital of France is Paris. 
          The capital of Germany is Berlin.
          [1000 more sentences about other topics]
          What is the capital of France?"

GPT-4: ✓ Correctly answers "Paris"
Gemini: Sometimes ✗ Answers something incorrect or "I don't see this info"

This is especially problematic for:

Multi-document summarisation
Long meeting transcripts with specific facts to recall
Legal document review requiring exact quote retrieval

Transformers’ full attention (GPT-4) naturally solve this. Mamba (see next paper) faces similar challenges with linear-time models.

8. No Native Video Processing in Initial Release

The paper mentions video, but:

The public API doesn’t accept videos (only text and images)
Internal systems may process video, but it’s not user-accessible
For video understanding, you still have to:
1. Extract key frames
2. Send each frame separately
3. Reassemble the context manually

Consequence: Not truly “natively multimodal” for end users — audio and video are missing.

9. Compute Cost and Environmental Impact

Training Gemini:

Used massive TPU clusters (v4 and v5, hundreds or thousands of chips)
Training time: Estimated weeks of continuous training
Energy cost: Estimated millions of kWh
Carbon footprint: Equivalent to flying round-trip from India to Europe thousands of times

The paper doesn’t discuss this. Frontier AI models increasingly face scrutiny for environmental impact.

10. No Fine-Tuning API (Initially)

Users couldn’t fine-tune Gemini on their own data in the early releases. You could:

✓ Use the API as-is
✗ Fine-tune on a specialized dataset
✗ Adapt it for your language or domain

This limits use cases like:

Medical AI (Gemini trained on general data, not medical literature)
Legal AI (not trained on case law)
Scientific research (not specialized for your field)

Later releases (Google AI Studio, Tuning API) added this, but not initially.

The Bigger Picture

Gemini’s limitations don’t make it a “bad” model — it’s still state-of-the-art in many benchmarks and useful in practice. But they highlight the gap between:

Marketing claims (“exceeds human experts”)
Research reality (strong on some benchmarks, weaker on others)
User experience (Pro is good, Nano is limited, Ultra is expensive)

This is the nature of frontier AI: every model is a trade-off, and no model is best at everything.

Next: Impact: What Changed After Gemini