Limitations: What Gemini Can’t Do (Yet)
The technical report claims native multimodality and state-of-the-art performance, but the reality is more nuanced. Here are the real limitations:
1. Sparse Technical Disclosure
The Gemini paper is a technical report, not a peer-reviewed research paper. This means:
- Architecture details are vague: The paper doesn’t fully explain how efficient attention works or how multimodal training data is constructed
- Training data not disclosed: No breakdown of text vs. image vs. audio vs. video in the training mix
- Hyperparameters hidden: Learning rate, batch size, training duration — all proprietary
- Reproducibility impossible: No open-source implementation or pre-trained weights (unlike Meta’s LLaMA)
Consequence: Researchers can’t verify claims or build on Gemini’s work easily. This is different from papers like “Attention Is All You Need” (Transformer), which disclosed enough to let thousands of teams reproduce it.
2. Benchmark Data Contamination Concerns
Gemini Ultra’s claim of 90.04% on MMLU (exceeding human experts at 89.8%) was later questioned:
- MMLU is a public benchmark with answers available online
- If Gemini’s training data scraped the web broadly, it might have seen MMLU questions
- Even a small amount of contamination could inflate scores
- Google later re-ran the benchmark more carefully, but full details weren’t published
Consequence: The “first model to beat humans” headline is less solid than initially claimed.
3. Context Length: Trailing Competition
Gemini Ultra’s initial release: 32K tokens (same as GPT-3.5, worse than GPT-4 Turbo’s 128K).
To put this in perspective:
- A typical article: ~2K tokens
- A short book: ~100K tokens
- A full meeting transcript: ~50K tokens
With 32K context, Gemini can’t:
- Process a full novel
- Summarize a full week of meeting transcripts
- Reason over a large codebase
Good news: Gemini 1.5 (released 2024) extended this to 1 million tokens, catching up. But at the initial announcement, this was a notable gap.
4. Weak Nano Model
Gemini Nano (2–7B parameters) is designed for on-device use. But:
- Accuracy drops significantly compared to Pro and Ultra
- Struggles with complex reasoning (math, code, nuance)
- Language understanding is basic (single-turn conversations work better than multi-turn)
For a student using Gemini Nano on their phone to help with homework:
- Easy tasks (summarize a paragraph): ✓ Works well
- Hard tasks (explain a concept in depth): ✗ Often too simple
Nano is useful for quick tasks, but not a substitute for Pro or Ultra.
5. The Multimodal Claim Isn’t Fully Verified
The paper claims “trained end-to-end on multimodal data,” but:
- Were all three model sizes (Ultra, Pro, Nano) trained jointly on multimodal data?
- Or was multimodal training only for Ultra, with Pro and Nano derived later?
- How much of the training data was actually multimodal vs. text-only?
The paper doesn’t clarify, and researchers have speculated that the “multimodal” framing might be marketing. Gemini may have significant text-only pre-training phases.
6. Delayed Ultra Release (Initially)
The announcement happened December 2023, but:
- Gemini Ultra was announced but not immediately available to the public
- Users had access to Gemini Pro (via Google Bard)
- Gemini Ultra came later via Google One AI Premium (subscription)
- This meant people couldn’t test the headline claims for weeks
Consequence: The benchmark claims (90.04% MMLU) couldn’t be independently verified immediately. Trust eroded a bit.
7. Struggling with “In-Context Recall”
A known weakness: retrieving specific information from earlier in a long context.
Example:
Context: "The capital of France is Paris.
The capital of Germany is Berlin.
[1000 more sentences about other topics]
What is the capital of France?"
GPT-4: ✓ Correctly answers "Paris"
Gemini: Sometimes ✗ Answers something incorrect or "I don't see this info"
This is especially problematic for:
- Multi-document summarisation
- Long meeting transcripts with specific facts to recall
- Legal document review requiring exact quote retrieval
Transformers’ full attention (GPT-4) naturally solve this. Mamba (see next paper) faces similar challenges with linear-time models.
8. No Native Video Processing in Initial Release
The paper mentions video, but:
- The public API doesn’t accept videos (only text and images)
- Internal systems may process video, but it’s not user-accessible
- For video understanding, you still have to:
- Extract key frames
- Send each frame separately
- Reassemble the context manually
Consequence: Not truly “natively multimodal” for end users — audio and video are missing.
9. Compute Cost and Environmental Impact
Training Gemini:
- Used massive TPU clusters (v4 and v5, hundreds or thousands of chips)
- Training time: Estimated weeks of continuous training
- Energy cost: Estimated millions of kWh
- Carbon footprint: Equivalent to flying round-trip from India to Europe thousands of times
The paper doesn’t discuss this. Frontier AI models increasingly face scrutiny for environmental impact.
10. No Fine-Tuning API (Initially)
Users couldn’t fine-tune Gemini on their own data in the early releases. You could:
- ✓ Use the API as-is
- ✗ Fine-tune on a specialized dataset
- ✗ Adapt it for your language or domain
This limits use cases like:
- Medical AI (Gemini trained on general data, not medical literature)
- Legal AI (not trained on case law)
- Scientific research (not specialized for your field)
Later releases (Google AI Studio, Tuning API) added this, but not initially.
The Bigger Picture
Gemini’s limitations don’t make it a “bad” model — it’s still state-of-the-art in many benchmarks and useful in practice. But they highlight the gap between:
- Marketing claims (“exceeds human experts”)
- Research reality (strong on some benchmarks, weaker on others)
- User experience (Pro is good, Nano is limited, Ultra is expensive)
This is the nature of frontier AI: every model is a trade-off, and no model is best at everything.