Further Reading — Paper 09: Mixture of Experts (2017)
Further Reading
The original paper
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017. https://arxiv.org/abs/1701.06538
The paper is dense but readable. Focus on Section 2 (the MoE layer definition), Section 3 (the gating formulas), and Figure 1 (architecture diagram). Table 1 shows the 137-billion-parameter model results. Section 5 (balancing loss) is worth reading carefully if you want to understand training stability.
The original MoE idea (1991)
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive Mixtures of Local Experts. Neural Computation.
The 26-year-old idea that Shazeer et al. revived. Worth skimming to appreciate how the field rediscovered and scaled a dormant concept. Available through most university library systems.
The direct simplification: Switch Transformer
Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. https://arxiv.org/abs/2101.03961
Simplifies k to 1 (one expert per token), scales to 1.6 trillion parameters, and proves MoE works reliably. Section 2 is the clearest explanation of MoE routing in the literature. Highly recommended as a companion to this paper.
Open-source MoE at scale
Jiang, A. Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088
Paper 18 in this curriculum. The first widely deployed open-source MoE model. 8 experts, k=2, 46.7 billion parameters, 12.9 billion active. Read after completing this paper to see how the 2017 ideas are implemented in a modern production model.
Deeper dives on routing and balancing
Zoph, B. et al. (2022). ST-MoE: Designing Stable and Transferable Sparse Expert Models. https://arxiv.org/abs/2202.08906
A thorough investigation of what makes MoE training stable. Covers the router z-loss (an improvement over the auxiliary balancing loss), capacity factor tuning, and expert specialisation analysis. Essential reading before implementing MoE from scratch.
Zhou, Y. et al. (2022). Mixture-of-Experts with Expert Choice Routing. https://arxiv.org/abs/2202.09368
Flips the routing direction: instead of tokens choosing experts (token choice), experts choose tokens (expert choice). Automatically balances load without the auxiliary loss. An elegant solution to the balancing problem.
Accessible explanation
Lilian Weng: “Mixture of Experts” https://lilianweng.github.io/posts/2024-01-01-moe/
A comprehensive 2024 survey covering the original paper through Mixtral and DeepSeek. Covers routing strategies, load balancing, hardware considerations, and specialisation analysis with diagrams. Best reference article on the topic.
Video
Yannic Kilcher: “Outrageously Large Neural Networks — Paper Explained” Search on YouTube. Kilcher walks through the paper’s equations with commentary on what works and what doesn’t. About 45 minutes. Particularly good on the auxiliary loss motivation.
Next in this curriculum
Paper 10 — GPT-1: Improving Language Understanding by Generative Pre-Training →
The decoder-only Transformer trained on next-token prediction — the direct ancestor of ChatGPT. Pre-dates widespread MoE adoption; uses a dense Transformer. Establishes the pre-train + fine-tune paradigm.
Math tutorials you will need for Paper 10:
- Cross-Entropy Loss → ✅ (just built)
- Probability Distributions → ✅
- Softmax Function → ✅