The Problem: Fixed State Space Models Are Too Simple
The Fundamental Limitation
Prior SSMs (S4, H3) use fixed A, B, C matrices:
x_t = A x_{t-1} + B u_t ← A and B are the same for all tokens
y_t = C x_t ← C is the same for all tokens
This means:
- The state decay rate (A) is constant — all tokens fade from memory at the same speed
- Input importance (B) is fixed — all tokens influence the state equally
- Output extraction (C) is uniform — all states contribute equally to output
Example: Reading a Research Paper
Imagine processing this sequence:
"The paper proposes [500 words of methodology] and achieves 95% accuracy."
With a fixed SSM:
- Word “proposes”: influence on state = B
- Word “methodology”: influence on state = B (same as “proposes”)
- Word “95%”: influence on state = B (same again)
- Word “accuracy”: influence on state = B (same still)
The model can’t distinguish between:
- Important (key results, novel methods)
- Filler (transitional phrases, background information)
All tokens are equally important to the SSM. But a smart reader (or a Transformer with attention) would focus on results and methods, not filler.
Concrete Failure Cases
1. Factual Recall
Sequence: "Alice lives in Paris.
[1000 words about unrelated topics]
Where does Alice live?"
S4 behavior:
x(0) = 0
x(1) = A·0 + B·"Alice" = input "Alice"
x(2) = A·x(1) + B·"lives" = A·input + input "lives"
x(3) = A·x(2) + B·"in" = A²·input + A·input + input
x(4) = A·x(3) + B·"Paris" = A³·input + A²·input + A·input + input
...
x(1005) = A^1001 · input(0) + ... ≈ 0 (decayed to nearly 0)
x(1006) = decode(x) = "?" (forgot "Alice lives in Paris")
Why? A^1001 is tiny (exponential decay over 1000 steps).
Transformers handle this better because they can attend back to the beginning with full strength.
2. Variable Importance
Sequence: "The theorem states that 2+2=4.
This is obvious and well-known.
Can you prove it?"
S4 behavior:
B = constant, so "theorem" and "obvious" have equal influence.
The model can't learn to prioritize "theorem" over "obvious".
Transformer behavior:
Attention learns: Query "prove" attends strongly to "theorem", weakly to "obvious".
This selectivity improves understanding.
3. Long-Range Dependencies
Sequence: "[Start of long passage]
Mary picked up a red book.
[500 irrelevant words]
She opened it carefully.
What color was it?"
S4 with fixed A = [-0.1]:
After 500 steps, the memory of "red book" (at step 2)
is multiplied by A^498 ≈ 0 (essentially forgotten).
Transformer:
Query "color" attends back to "red" token with full strength.
Correctly predicts "red".
Why Selectivity Matters
Human intelligence relies on selective attention:
- Filtering irrelevance: “The coffee is hot” (fact), vs. “coffee is a beverage” (boring)
- Prioritizing importance: “The capital is Paris” (remember), vs. “Paris is a city” (obvious)
- Remembering key facts: Medical diagnosis depends on specific symptoms, not all symptoms equally
Fixed SSMs can’t do this. They process all tokens identically.
The Trade-off
Fixed SSMs (S4):
- ✓ O(n) time complexity
- ✓ O(1) memory per step
- ✗ Can’t prioritize important information
- ✗ Worse language modeling perplexity than Transformers
Transformers:
- ✓ Full flexibility (attend to any token with any strength)
- ✓ Excellent language modeling and reasoning
- ✗ O(n²) time and memory for attention
- ✗ Slow for long sequences (inference especially)
What if we could have both? Selectivity (smart filtering like humans) AND efficiency (O(n) time)?
That’s what Mamba attempts.
The Missing Piece
The key insight: if B, C, and Δ (discretization step) depend on the input, the SSM can be selective.
Current (Fixed): x_t = A x_{t-1} + B u_t (B is constant)
Proposed (Selective): x_t = Ā x_{t-1} + B_t u_t (B_t depends on u_t)
where B_t = Linear(u_t)
and Ā = exp(Δ_t A) (Δ_t depends on u_t)
Now:
- If u_t is important (e.g., “Paris”), B_t is large → strong influence on state
- If u_t is noise (e.g., “the”), B_t is small → weak influence on state
- The model learns what “important” means from data
But here’s the catch: How do we compute this efficiently?
If Ā changes every step, can we still run in O(n) time?
The answer requires a careful algorithm (Section 3) and hardware optimization.