Context: The KV Cache Problem

In 2023, the world had a problem: language models were getting better, but they were getting expensive. Not just to train — that was a one-time cost — but to run.

The bottleneck wasn’t compute power. GPUs could run matrix multiplications all day. The real bottleneck was memory bandwidth during inference. And the culprit was something called the KV cache.

Why the KV cache exists

When you generate text token-by-token (autoregressive generation), you compute attention at each step:

Step 1: Input = [token₁]
        Compute attention: token₁ attends to [token₁]
        Predict token₂
        
Step 2: Input = [token₁, token₂]
        Compute attention: token₂ attends to [token₁, token₂]
        Predict token₃
        
Step 3: Input = [token₁, token₂, token₃]
        Compute attention: token₃ attends to [token₁, token₂, token₃]
        Predict token₄

The issue: at Step 3, you need the Key and Value vectors from Step 1 and Step 2 — but you already computed them at Steps 1 and 2. Recomputing them would be wasteful.

So you cache them: store K and V vectors from all previous steps in GPU memory. Then, when you compute attention for the new token, you:

Compute Q for the new token
Multiply Q by all the cached K’s
Get attention weights
Multiply by all the cached V’s

This is smart — it avoids redundant computation. But it has a cost: you must store every K and V vector, forever.

The memory explosion

Standard Transformer attention uses Multi-Head Attention (MHA): there are n_heads independent attention heads, and each one has its own K and V cache.

For a model with:

n_heads = 32 (like LLaMA 2 7B)
d_head = 128 (dimension per head)
seq_len = 8192 (context length)

The KV cache size is:

KV_cache = 2 × n_heads × d_head × seq_len
         = 2 × 32 × 128 × 8192
         = 67,108,864 values
         ≈ 256 MB (in float32)
         ≈ 128 MB (in float16)

For a 32-layer model, that’s ~4 GB just for KV caches. Multiply by a batch size of 32, and you’re looking at 128 GB of memory for a single inference batch — before you even load the model weights or compute gradients.

This is why:

Longer context = exponentially higher memory cost (the KV cache grows linearly with sequence length)
Batch inference is limited — you can’t fit many samples if you also need room for their KV caches
Mobile inference is nearly impossible — phones don’t have 4 GB of GPU memory for one inference step

Why efficiency suddenly mattered

By 2023, everyone was racing to build longer-context models. GPT-4 could handle 32K tokens. Researchers wanted 100K, 1M token contexts. But without efficiency tricks, the KV cache would be gigabytes, then terabytes.

Meanwhile, open-source models like LLaMA struggled. LLaMA 2 13B had better quality than 7B, but inference was 2× slower and memory costs were 4× higher (because n_heads ∝ parameters).

The practical question became: Could you build a 7B model that matches 13B quality without the memory overhead?

The answer was yes — but only if you redesigned attention itself.

The two pieces of the solution

In late 2023, Mistral AI released Mistral 7B with two attention innovations:

Grouped Query Attention (GQA): Instead of each Q head having its own K and V head (as in MHA), group the KV heads. Multiple Q heads share a single KV head. This reduces the KV cache by a factor of 4 (from 256 MB to 64 MB in the example above).
Sliding Window Attention (SWA): Instead of attending to all previous tokens, each token attends only to the last W = 4096 tokens. This further reduces memory (KV cache stores only 4096 tokens, not all 8192) and computation (attention is O(n × W) instead of O(n²)).

Together, these two tricks made inference 4–8× faster and 4–8× more memory-efficient — with no loss in quality. In fact, Mistral 7B often outperformed LLaMA 2 13B.

The rest of the paper explains how they work.