Mistral 7B

Mistral 7B is a 7-billion-parameter language model that outperforms LLaMA 2 13B on most benchmarks while being nearly half the size. Its genius lies not in scaling up — it scales down intelligently.

The two key innovations are Grouped Query Attention (GQA) and Sliding Window Attention (SWA). Together, they slash memory requirements and inference time while maintaining, even improving, quality. GQA reduces the KV cache memory footprint, and SWA makes long-context processing practical by attending only to the most recent tokens instead of everything that came before.

When Mistral released Mistral 7B under the Apache 2.0 license (free for commercial use), it became the go-to foundation for open-source applications — and proved that architectural efficiency innovations matter as much as raw scale.

What this paper did

The core story: A 7B model that:

Outperforms LLaMA 2 13B on common sense reasoning (MMLU: 60.97% vs 54.16%), math reasoning (GSM8k: 52.16% vs 39.45%), and most other benchmarks
Matches LLaMA 2 34B on code generation and reasoning tasks
Uses 4× less KV cache memory during inference than standard Multi-Head Attention
Implements sliding window attention (only attends to the last 4096 tokens) — reducing attention complexity from O(n²) to O(n·W)
Was trained on 15 trillion tokens (similar to LLaMA 2 70B)
Released with full weights under Apache 2.0 — open for commercial use

Key equations (informal, expanded in detail sections):

GQA Memory Saving:
  Standard MHA KV cache = 2 × n_heads × d_head × seq_len
  GQA KV cache = 2 × n_kv_heads × d_head × seq_len
  Mistral 7B: reduction factor = n_heads / n_kv_heads = 32 / 8 = 4×

Sliding Window Attention Complexity:
  Standard: O(n² × d) — all pairs
  SWA: O(n × W × d) — only recent W tokens
  For n = 32000, W = 4096: ~8× speedup per layer

Effective Receptive Field:
  With k layers and window size W:
  receptive field ≈ k × W tokens
  Mistral 7B: 32 layers × 4096 = 131,072 token context

The Indian analogy

For Grouped Query Attention: Imagine a classroom where each student normally has their own personal tutor (that’s Multi-Head Attention). This is expensive — you need as many tutors as students.

Now imagine a variant where all students share one tutor (Multi-Query Attention). That’s cheap but the tutor is stretched thin.

GQA is the middle ground: students are grouped in pairs or small groups, and each group shares one tutor. Student 1 and Student 2 share Tutor A, Student 3 and Student 4 share Tutor B. Each group gets personalised attention without the cost of individual tutors. This is Mistral’s magic.

For Sliding Window Attention: Reading a very long government report, but your working memory only holds the last 10 pages at a time. You don’t need perfect recall of page 1 when reading page 200 — the relevant context usually comes from nearby pages. Historical background and fine details are embedded in the intermediate sections.

SWA formalises this: each token focuses only on the last W tokens in the context. Information from much earlier pages reaches you indirectly through multiple layers — it’s like reading summaries-of-summaries. By layer 32, the “view” extends far beyond W tokens.

Comparison: Mistral 7B vs Peers

Metric	Mistral 7B	LLaMA 2 13B	LLaMA 2 34B
Parameters	7B	13B	34B
Training tokens	15T	2T	2T
MMLU	60.97%	54.16%	63.16%
GSM8k (math)	52.16%	39.45%	50.74%
KV cache (seq 8k, full attn)	16.8 GB	52.4 GB	137.3 GB
Inference speed	Fast	~1.5× slower	~4× slower
License	Apache 2.0	Llama 2 Community	Llama 2 Community

Read in this order

Section	What you will learn	Difficulty	Time
01 Context	Why efficiency matters; the KV cache problem	🟢 Beginner	8 min
02 The Problem	Limitations of Multi-Head Attention and naive long-context	🟡 Intermediate	10 min
03 The Idea	GQA intuition + SWA intuition; why they work	🟡 Intermediate	12 min
04 The Math	Formal definitions, numerical examples, receptive field calc	🔴 Advanced	12 min
05 Worked Example	Step-by-step trace on small input; verify by hand	🟡 Intermediate	8 min
06 The Code	Python implementation of GQA; run in Colab	🟢 Beginner	5 min
07 Limitations	Real constraints: window size, training cost, quality tradeoff	🟡 Intermediate	6 min
08 Impact	What changed: commercial adoption, follow-up papers	🟢 Beginner	8 min
09 Summary	One-sentence recap; what to read next	🟢 Beginner	2 min

Before you read: Math tutorials you need

Scaled Dot-Product Attention — how standard attention computes scores
Multi-Head Attention — what GQA is an improvement over
Attention Complexity and KV Cache — why memory is the real bottleneck
Softmax and Temperature — background for attention weight computation

Architecture Overview

┌─────────────────────────────────────────────────────┐
│           Mistral 7B Architecture                   │
├─────────────────────────────────────────────────────┤
│                                                     │
│  Input Sequence: [token₁, token₂, ..., tokenₙ]    │
│         ↓                                           │
│  Embedding Layer (4096 dim)                         │
│         ↓                                           │
│  ┌─────────────────────────────────────┐            │
│  │ Transformer Block (32 layers)       │            │
│  │  • Grouped Query Attention (GQA)    │            │
│  │    32 Q heads, 8 KV heads           │            │
│  │    KV shared across 4 Q heads       │            │
│  │    Window size W = 4096             │            │
│  │  • Feed-forward MLP (14B params)    │            │
│  │  • RMSNorm, Rotary Embeddings       │            │
│  │  • SiLU activation                  │            │
│  └─────────────────────────────────────┘            │
│         ↓                                           │
│  Output Logits (vocab 32k)                          │
│         ↓                                           │
│  Probability Distribution → Next Token              │
│                                                     │
│  Total Parameters: 7B                              │
│  KV Cache per token (seq=8k): 2 MB (vs 8 MB MHA)  │
│                                                     │
└─────────────────────────────────────────────────────┘

← Paper 17: LLaMA | Paper 19: Ring Attention →

Mistral 7B

Mistral 7B

What this paper did

The Indian analogy

Comparison: Mistral 7B vs Peers

Read in this order

Before you read: Math tutorials you need

Architecture Overview

Navigation

Discussion