The paper is dense but readable. Focus on Section 2 (the MoE layer definition), Section 3 (the gating formulas), and Figure 1 (architecture diagram). Table 1 shows the 137-billion-parameter model results. Section 5 (balancing loss) is worth reading carefully if you want to understand training stability.

The original MoE idea (1991)

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive Mixtures of Local Experts. Neural Computation.

The 26-year-old idea that Shazeer et al. revived. Worth skimming to appreciate how the field rediscovered and scaled a dormant concept. Available through most university library systems.

The direct simplification: Switch Transformer

Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. https://arxiv.org/abs/2101.03961

Simplifies k to 1 (one expert per token), scales to 1.6 trillion parameters, and proves MoE works reliably. Section 2 is the clearest explanation of MoE routing in the literature. Highly recommended as a companion to this paper.

Open-source MoE at scale

Jiang, A. Q. et al. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088

Paper 18 in this curriculum. The first widely deployed open-source MoE model. 8 experts, k=2, 46.7 billion parameters, 12.9 billion active. Read after completing this paper to see how the 2017 ideas are implemented in a modern production model.

Deeper dives on routing and balancing

Zoph, B. et al. (2022). ST-MoE: Designing Stable and Transferable Sparse Expert Models. https://arxiv.org/abs/2202.08906

A thorough investigation of what makes MoE training stable. Covers the router z-loss (an improvement over the auxiliary balancing loss), capacity factor tuning, and expert specialisation analysis. Essential reading before implementing MoE from scratch.

Zhou, Y. et al. (2022). Mixture-of-Experts with Expert Choice Routing. https://arxiv.org/abs/2202.09368

Flips the routing direction: instead of tokens choosing experts (token choice), experts choose tokens (expert choice). Automatically balances load without the auxiliary loss. An elegant solution to the balancing problem.

Accessible explanation

Lilian Weng: “Mixture of Experts” https://lilianweng.github.io/posts/2024-01-01-moe/

A comprehensive 2024 survey covering the original paper through Mixtral and DeepSeek. Covers routing strategies, load balancing, hardware considerations, and specialisation analysis with diagrams. Best reference article on the topic.

Video

Yannic Kilcher: “Outrageously Large Neural Networks — Paper Explained” Search on YouTube. Kilcher walks through the paper’s equations with commentary on what works and what doesn’t. About 45 minutes. Particularly good on the auxiliary loss motivation.

Next in this curriculum

Paper 10 — GPT-1: Improving Language Understanding by Generative Pre-Training →

The decoder-only Transformer trained on next-token prediction — the direct ancestor of ChatGPT. Pre-dates widespread MoE adoption; uses a dense Transformer. Establishes the pre-train + fine-tune paradigm.

Math tutorials you will need for Paper 10:

Further Reading — Paper 09: Mixture of Experts (2017)

Further Reading

The original paper