Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper by: Albert Gu, Tri Dao
Published: December 2023
Venue: ICLR 2024 (arXiv 2023)
URL: https://arxiv.org/abs/2312.00752
What This Paper Did
Transformers are powerful but slow for long sequences — their attention mechanism is O(n²) in sequence length. Mamba trades off some flexibility to achieve O(n) linear-time performance while matching or exceeding Transformer quality on language modeling.
The key innovation: Selective State Space Models (SSMs). Instead of using fixed state transition matrices (like prior SSMs), Mamba makes the state matrices input-dependent. This allows the model to:
- Remember important information (slow “decay” for key facts)
- Forget irrelevant details (fast “decay” for noise)
- Run extremely efficiently — 5× faster than Transformers for sequences of 2K tokens and beyond
Mamba matches or beats Transformers on:
- Language modeling perplexity (how well it predicts the next word)
- Code generation (HumanEval)
- Mathematical reasoning
But runs significantly faster and uses less memory.
Core Numbers
| Metric | Mamba | Transformer |
|---|---|---|
| Inference Speed (2K seq) | 1× | 5× slower |
| Memory (1M tokens) | Constant O(1) | O(n) (quadratic) |
| Training Compute | Similar | Similar |
| Language Modeling (Chinchilla 7B) | Better | Baseline |
| HumanEval (code) | 57% | 55% |
| Inference Mode | Sequential (O(1) per step) | Parallel (O(n) total) |
The Trade-off
Transformers:
- ✓ Flexible attention (each position attends to any other)
- ✗ O(n²) inference for sequential generation
- ✓ Easy parallelization during training
Mamba:
- ✓ O(n) linear-time inference (massively faster)
- ✗ Less flexible than full attention
- ✗ Harder to parallelize (recurrent structure)
Mamba’s bet: For most real-world tasks, selectivity (focusing on important tokens) beats flexibility (attending to everything).
The Indian Analogy
Imagine a student (the model) processing a long river of information (sequence of tokens).
Traditional Attention (Transformer): The student stops at every word and asks: “What does each previous word tell me about this one?” For a paragraph of 1,000 words, they ask this question 1,000² = 1,000,000 times. Powerful understanding, but exhausting and slow.
State Space Models (Fixed, like S4): The student processes words like a river flowing downstream. Information decays naturally (old information is forgotten). But the rate of decay is the same regardless of content. During monsoon season, information decays fast. During summer, it decays fast too. The student can’t adjust.
Selective State Space Models (Mamba): The student now decides dynamically: “This word is important (teacher said it will be on the exam), so let me remember it longer (slow decay).” vs. “This word is just filler, I’ll forget it quickly (fast decay).” The decay rate depends on the content.
Result: The student understands as well as the attentive approach, but processes information as fast as the fixed river (O(n), not O(n²)).
Read This Paper in This Order
| Section | What You Will Learn | Difficulty | Time |
|---|---|---|---|
| 01 — Context | Why Transformers are slow; prior SSM work (S4); the state space model framework | 🟡 Intermediate | 8 min |
| 02 — The Problem | Fixed SSMs treat all tokens equally; selectivity is key | 🟡 Intermediate | 6 min |
| 03 — The Idea | Selective SSM: make B, C, Δ input-dependent; hardware-aware algorithm | 🟡 Intermediate | 12 min |
| 04 — The Math | SSM equations; discretisation; selective parameters; hardware scan | 🔴 Advanced | 14 min |
| 05 — Worked Example | Complete trace of SSM forward pass with real numbers | 🟡 Intermediate | 10 min |
| 06 — The Code | Minimal Mamba implementation in PyTorch | 🟡 Intermediate | 5 min |
| 07 — Limitations | In-context recall struggles; training overhead; adoption challenges | 🟡 Intermediate | 7 min |
| 08 — Impact | Mamba 2, hybrid models, state space model renaissance | 🟢 Beginner | 5 min |
| 09 — Summary | One-paragraph recap; what came next | 🟢 Beginner | 2 min |
Total reading time: ~50 minutes
Before You Read: Math Tutorials You Need
You should already know these concepts. If not, read them first:
- Eigenvalues and Eigenvectors — State space stability depends on eigenvalues of A
- Matrix Multiplication and Projections — Core of SSM math
- Differential Equations (Brief) — SSMs model continuous-time dynamics (optional, advanced)
Architecture Diagram
Input Sequence u = [u₁, u₂, ..., u_n]
|
↓
Compute selective parameters at each step:
Δₜ = softplus(Wₓ u_t) (input-dependent step size)
B_t = Linear(u_t) (input-dependent input matrix)
C_t = Linear(u_t) (input-dependent output matrix)
|
↓
Discretise state transition matrix:
Ā = exp(Δₜ × A) (depends on current input)
B̄ = Δₜ × B_t
|
↓
Recurrent update (same for all t):
x_t = Ā × x_{t-1} + B̄ × u_t
|
↓
Output:
y_t = C_t × x_t
|
↓
Predicted next token
Discussion
Questions about this paper? Spotted something unclear? Start a discussion below — powered by GitHub, no separate account needed.