Mistral 7B
Mistral 7B
Mistral 7B is a 7-billion-parameter language model that outperforms LLaMA 2 13B on most benchmarks while being nearly half the size. Its genius lies not in scaling up — it scales down intelligently.
The two key innovations are Grouped Query Attention (GQA) and Sliding Window Attention (SWA). Together, they slash memory requirements and inference time while maintaining, even improving, quality. GQA reduces the KV cache memory footprint, and SWA makes long-context processing practical by attending only to the most recent tokens instead of everything that came before.
When Mistral released Mistral 7B under the Apache 2.0 license (free for commercial use), it became the go-to foundation for open-source applications — and proved that architectural efficiency innovations matter as much as raw scale.
What this paper did
The core story: A 7B model that:
- Outperforms LLaMA 2 13B on common sense reasoning (MMLU: 60.97% vs 54.16%), math reasoning (GSM8k: 52.16% vs 39.45%), and most other benchmarks
- Matches LLaMA 2 34B on code generation and reasoning tasks
- Uses 4× less KV cache memory during inference than standard Multi-Head Attention
- Implements sliding window attention (only attends to the last 4096 tokens) — reducing attention complexity from O(n²) to O(n·W)
- Was trained on 15 trillion tokens (similar to LLaMA 2 70B)
- Released with full weights under Apache 2.0 — open for commercial use
Key equations (informal, expanded in detail sections):
GQA Memory Saving:
Standard MHA KV cache = 2 × n_heads × d_head × seq_len
GQA KV cache = 2 × n_kv_heads × d_head × seq_len
Mistral 7B: reduction factor = n_heads / n_kv_heads = 32 / 8 = 4×
Sliding Window Attention Complexity:
Standard: O(n² × d) — all pairs
SWA: O(n × W × d) — only recent W tokens
For n = 32000, W = 4096: ~8× speedup per layer
Effective Receptive Field:
With k layers and window size W:
receptive field ≈ k × W tokens
Mistral 7B: 32 layers × 4096 = 131,072 token context
The Indian analogy
For Grouped Query Attention: Imagine a classroom where each student normally has their own personal tutor (that’s Multi-Head Attention). This is expensive — you need as many tutors as students.
Now imagine a variant where all students share one tutor (Multi-Query Attention). That’s cheap but the tutor is stretched thin.
GQA is the middle ground: students are grouped in pairs or small groups, and each group shares one tutor. Student 1 and Student 2 share Tutor A, Student 3 and Student 4 share Tutor B. Each group gets personalised attention without the cost of individual tutors. This is Mistral’s magic.
For Sliding Window Attention: Reading a very long government report, but your working memory only holds the last 10 pages at a time. You don’t need perfect recall of page 1 when reading page 200 — the relevant context usually comes from nearby pages. Historical background and fine details are embedded in the intermediate sections.
SWA formalises this: each token focuses only on the last W tokens in the context. Information from much earlier pages reaches you indirectly through multiple layers — it’s like reading summaries-of-summaries. By layer 32, the “view” extends far beyond W tokens.
Comparison: Mistral 7B vs Peers
| Metric | Mistral 7B | LLaMA 2 13B | LLaMA 2 34B |
|---|---|---|---|
| Parameters | 7B | 13B | 34B |
| Training tokens | 15T | 2T | 2T |
| MMLU | 60.97% | 54.16% | 63.16% |
| GSM8k (math) | 52.16% | 39.45% | 50.74% |
| KV cache (seq 8k, full attn) | 16.8 GB | 52.4 GB | 137.3 GB |
| Inference speed | Fast | ~1.5× slower | ~4× slower |
| License | Apache 2.0 | Llama 2 Community | Llama 2 Community |
Read in this order
| Section | What you will learn | Difficulty | Time |
|---|---|---|---|
| 01 Context | Why efficiency matters; the KV cache problem | 🟢 Beginner | 8 min |
| 02 The Problem | Limitations of Multi-Head Attention and naive long-context | 🟡 Intermediate | 10 min |
| 03 The Idea | GQA intuition + SWA intuition; why they work | 🟡 Intermediate | 12 min |
| 04 The Math | Formal definitions, numerical examples, receptive field calc | 🔴 Advanced | 12 min |
| 05 Worked Example | Step-by-step trace on small input; verify by hand | 🟡 Intermediate | 8 min |
| 06 The Code | Python implementation of GQA; run in Colab | 🟢 Beginner | 5 min |
| 07 Limitations | Real constraints: window size, training cost, quality tradeoff | 🟡 Intermediate | 6 min |
| 08 Impact | What changed: commercial adoption, follow-up papers | 🟢 Beginner | 8 min |
| 09 Summary | One-sentence recap; what to read next | 🟢 Beginner | 2 min |
Before you read: Math tutorials you need
- Scaled Dot-Product Attention — how standard attention computes scores
- Multi-Head Attention — what GQA is an improvement over
- Attention Complexity and KV Cache — why memory is the real bottleneck
- Softmax and Temperature — background for attention weight computation
Architecture Overview
┌─────────────────────────────────────────────────────┐
│ Mistral 7B Architecture │
├─────────────────────────────────────────────────────┤
│ │
│ Input Sequence: [token₁, token₂, ..., tokenₙ] │
│ ↓ │
│ Embedding Layer (4096 dim) │
│ ↓ │
│ ┌─────────────────────────────────────┐ │
│ │ Transformer Block (32 layers) │ │
│ │ • Grouped Query Attention (GQA) │ │
│ │ 32 Q heads, 8 KV heads │ │
│ │ KV shared across 4 Q heads │ │
│ │ Window size W = 4096 │ │
│ │ • Feed-forward MLP (14B params) │ │
│ │ • RMSNorm, Rotary Embeddings │ │
│ │ • SiLU activation │ │
│ └─────────────────────────────────────┘ │
│ ↓ │
│ Output Logits (vocab 32k) │
│ ↓ │
│ Probability Distribution → Next Token │
│ │
│ Total Parameters: 7B │
│ KV Cache per token (seq=8k): 2 MB (vs 8 MB MHA) │
│ │
└─────────────────────────────────────────────────────┘
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.