Paper 18
Intermediate

Mistral 7B

Mistral 7B

Mistral 7B is a 7-billion-parameter language model that outperforms LLaMA 2 13B on most benchmarks while being nearly half the size. Its genius lies not in scaling up — it scales down intelligently.

The two key innovations are Grouped Query Attention (GQA) and Sliding Window Attention (SWA). Together, they slash memory requirements and inference time while maintaining, even improving, quality. GQA reduces the KV cache memory footprint, and SWA makes long-context processing practical by attending only to the most recent tokens instead of everything that came before.

When Mistral released Mistral 7B under the Apache 2.0 license (free for commercial use), it became the go-to foundation for open-source applications — and proved that architectural efficiency innovations matter as much as raw scale.

What this paper did

The core story: A 7B model that:

  • Outperforms LLaMA 2 13B on common sense reasoning (MMLU: 60.97% vs 54.16%), math reasoning (GSM8k: 52.16% vs 39.45%), and most other benchmarks
  • Matches LLaMA 2 34B on code generation and reasoning tasks
  • Uses 4× less KV cache memory during inference than standard Multi-Head Attention
  • Implements sliding window attention (only attends to the last 4096 tokens) — reducing attention complexity from O(n²) to O(n·W)
  • Was trained on 15 trillion tokens (similar to LLaMA 2 70B)
  • Released with full weights under Apache 2.0 — open for commercial use

Key equations (informal, expanded in detail sections):

GQA Memory Saving:
  Standard MHA KV cache = 2 × n_heads × d_head × seq_len
  GQA KV cache = 2 × n_kv_heads × d_head × seq_len
  Mistral 7B: reduction factor = n_heads / n_kv_heads = 32 / 8 = 4×

Sliding Window Attention Complexity:
  Standard: O(n² × d) — all pairs
  SWA: O(n × W × d) — only recent W tokens
  For n = 32000, W = 4096: ~8× speedup per layer

Effective Receptive Field:
  With k layers and window size W:
  receptive field ≈ k × W tokens
  Mistral 7B: 32 layers × 4096 = 131,072 token context

The Indian analogy

For Grouped Query Attention: Imagine a classroom where each student normally has their own personal tutor (that’s Multi-Head Attention). This is expensive — you need as many tutors as students.

Now imagine a variant where all students share one tutor (Multi-Query Attention). That’s cheap but the tutor is stretched thin.

GQA is the middle ground: students are grouped in pairs or small groups, and each group shares one tutor. Student 1 and Student 2 share Tutor A, Student 3 and Student 4 share Tutor B. Each group gets personalised attention without the cost of individual tutors. This is Mistral’s magic.

For Sliding Window Attention: Reading a very long government report, but your working memory only holds the last 10 pages at a time. You don’t need perfect recall of page 1 when reading page 200 — the relevant context usually comes from nearby pages. Historical background and fine details are embedded in the intermediate sections.

SWA formalises this: each token focuses only on the last W tokens in the context. Information from much earlier pages reaches you indirectly through multiple layers — it’s like reading summaries-of-summaries. By layer 32, the “view” extends far beyond W tokens.

Comparison: Mistral 7B vs Peers

MetricMistral 7BLLaMA 2 13BLLaMA 2 34B
Parameters7B13B34B
Training tokens15T2T2T
MMLU60.97%54.16%63.16%
GSM8k (math)52.16%39.45%50.74%
KV cache (seq 8k, full attn)16.8 GB52.4 GB137.3 GB
Inference speedFast~1.5× slower~4× slower
LicenseApache 2.0Llama 2 CommunityLlama 2 Community

Read in this order

SectionWhat you will learnDifficultyTime
01 ContextWhy efficiency matters; the KV cache problem🟢 Beginner8 min
02 The ProblemLimitations of Multi-Head Attention and naive long-context🟡 Intermediate10 min
03 The IdeaGQA intuition + SWA intuition; why they work🟡 Intermediate12 min
04 The MathFormal definitions, numerical examples, receptive field calc🔴 Advanced12 min
05 Worked ExampleStep-by-step trace on small input; verify by hand🟡 Intermediate8 min
06 The CodePython implementation of GQA; run in Colab🟢 Beginner5 min
07 LimitationsReal constraints: window size, training cost, quality tradeoff🟡 Intermediate6 min
08 ImpactWhat changed: commercial adoption, follow-up papers🟢 Beginner8 min
09 SummaryOne-sentence recap; what to read next🟢 Beginner2 min

Before you read: Math tutorials you need

Architecture Overview

┌─────────────────────────────────────────────────────┐
│           Mistral 7B Architecture                   │
├─────────────────────────────────────────────────────┤
│                                                     │
│  Input Sequence: [token₁, token₂, ..., tokenₙ]    │
│         ↓                                           │
│  Embedding Layer (4096 dim)                         │
│         ↓                                           │
│  ┌─────────────────────────────────────┐            │
│  │ Transformer Block (32 layers)       │            │
│  │  • Grouped Query Attention (GQA)    │            │
│  │    32 Q heads, 8 KV heads           │            │
│  │    KV shared across 4 Q heads       │            │
│  │    Window size W = 4096             │            │
│  │  • Feed-forward MLP (14B params)    │            │
│  │  • RMSNorm, Rotary Embeddings       │            │
│  │  • SiLU activation                  │            │
│  └─────────────────────────────────────┘            │
│         ↓                                           │
│  Output Logits (vocab 32k)                          │
│         ↓                                           │
│  Probability Distribution → Next Token              │
│                                                     │
│  Total Parameters: 7B                              │
│  KV Cache per token (seq=8k): 2 MB (vs 8 MB MHA)  │
│                                                     │
└─────────────────────────────────────────────────────┘

Paper 17: LLaMA | Paper 19: Ring Attention →

Discussion

Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.