Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Shazeer, Mirhoseini, Maziarz, Davis, Le, Hinton, Dean · ICLR 2017 · arXiv:1701.06538

What this paper did

It broke the link between model size and compute cost.

In a standard dense neural network, every parameter fires for every input — doubling model size means doubling compute, permanently. This hard coupling made scaling beyond a few billion parameters economically impossible in 2017.

Shazeer’s team replaced the FFN sub-layer in their network with a Mixture of Experts layer: n expert networks (each a standard FFN) plus a learned gating function that routes each token to only k of them. With n=100 experts and k=2, you have 100× the parameters but pay the compute cost of only 2. The model’s knowledge capacity and its per-token inference cost become independent quantities.

The key equations:

G(x) = Softmax( TopK( x · W_g, k ) )    ← sparse routing weights
MoE(x) = Σᵢ G(x)ᵢ · Eᵢ(x)             ← weighted blend of k active experts
L_balance = α · n · Σᵢ fᵢ · pᵢ         ← auxiliary loss preventing expert collapse

The result: a 137-billion-parameter model that trains at the cost of a dense ~10-billion-parameter model. By 2023, MoE was the likely architecture of every frontier AI system.

The Indian analogy

A government hospital with 1,000 specialists. The gating doctor (gating network) briefly examines each patient (token) and routes them to the 2 most relevant specialists (top-k experts). The hospital’s total knowledge is vast, but each patient consults only a small fraction of it. The auxiliary balancing loss is the hospital administrator ensuring no single specialist gets a three-year waiting list while others sit empty.

Read in this order

Section	What you will learn	Difficulty	Time
1. Context	The compute wall of 2017, MoE’s 1990s origins	🟢	4 min
2. The Problem	Every neuron firing for every token is wasteful	🟢	3 min
3. The Idea	Hospital analogy, sparse routing, expert specialisation	🟡	5 min
4. The Math	Gating function, TopK, auxiliary loss — worked by hand	🔴	10 min
5. Worked Example	4 experts routing “chai bahut garam hai” token by token	🔴	8 min
6. The Code	Full MoE forward pass and balancing loss in NumPy	🟡	6 min
7. Limitations	Communication overhead, collapse, token dropping	🟡	4 min
8. Impact	Switch Transformer, Mixtral, GPT-4, the frontier	🟢	4 min
9. Summary	One-page recap	🟢	3 min

Also: Glossary · Quiz · Further Reading

Before you read: math tutorials you need

Softmax Function → — TopK + Softmax produces the sparse gating weights ✅
Cross-Entropy Loss → — the main training objective the MoE minimises ✅
Probability Distributions → — gating weights are a sparse probability distribution ✅
Matrix Multiplication → — used in gating (x · W_g) and expert FFNs ✅

MoE layer at a glance

Input token x (d_model dimensions)
       │
       ▼
  ┌──────────────────────────────────┐
  │  GATING NETWORK                 │
  │  logits = x · W_g               │   (one score per expert)
  │  mask all but top-k to -∞       │
  │  G(x) = Softmax(masked logits)  │   (sparse weights, sum to 1)
  └──────────────────────────────────┘
       │
       ├──── Expert i  (if G(x)ᵢ > 0) → Eᵢ(x) → weight G(x)ᵢ
       ├──── Expert j  (if G(x)ⱼ > 0) → Eⱼ(x) → weight G(x)ⱼ
       └──── All other experts: SKIPPED (G = 0, no compute)
       │
       ▼
  MoE(x) = Σᵢ G(x)ᵢ · Eᵢ(x)   (only k terms non-zero)

← Paper 08 — Transformer → Paper 10 — GPT-1

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer