8. Impact — the architecture behind frontier AI today

The 2017 MoE paper did not immediately change the mainstream. The Transformer paper appeared the same year and received far more attention. MoE sat in the background as a niche research direction — promising but difficult.

Then, starting around 2021, something shifted. As dense models approached the limits of what was feasible, MoE became not just interesting but necessary. Today, the most capable AI systems in the world are almost certainly MoE-based.

Google’s Switch Transformer (2021): k=1 and scale

Fedus, Zoph, and Shazeer (the same Shazeer from the 2017 paper) published the Switch Transformer, which simplified MoE radically: use k=1 (route each token to exactly one expert). This “switch” routing eliminates the need to combine two expert outputs, simplifies the gating function, and makes training more stable.

The Switch Transformer scaled to over a trillion parameters (with 2,048 experts) and showed that a sparse MoE model matched the quality of a dense T5 model (11 billion parameters) at roughly 7× lower training compute cost. For the first time, MoE was demonstrated to be a reliable path to efficiency, not just an interesting curiosity.

Google’s GLaM (2022): production-scale MoE

GLaM (Generalist Language Model) had 1.2 trillion parameters with 64 experts per layer and k=2. It matched GPT-3’s performance on language benchmarks using roughly one-third of the energy for training. GLaM was among the first public demonstrations of MoE in a system approaching production quality.

GPT-4 (2023): widely believed to be MoE

OpenAI has not officially confirmed GPT-4’s architecture. Multiple credible leaks and analyses by researchers suggest GPT-4 is a MoE model — reportedly 8 experts with around 220 billion total parameters but only ~55 billion active per token. If accurate, this means GPT-4 achieves dense-model quality at a fraction of the inference compute, which would explain how OpenAI can serve it at scale.

Whether or not the specific numbers are right, the broader point holds: at the frontier of language model capabilities in 2023–2025, MoE is not a research experiment — it is likely the dominant architecture.

Mixtral 8×7B (2023, Paper 18): open-source MoE for everyone

Mistral AI published Mixtral 8×7B — an open-source, freely downloadable MoE model with 8 experts per layer, 46.7 billion total parameters, and 12.9 billion active per token (k=2). It outperformed the dense LLaMA-2 70B on most benchmarks while being faster at inference.

Mixtral was the first widely-used, publicly-available MoE language model. It put the architecture in the hands of every researcher and developer, not just large labs. Paper 18 in this curriculum covers it in detail.

DeepSeek-V2 and MoE in 2024

Chinese lab DeepSeek released DeepSeek-V2 in 2024 — a 236-billion-parameter MoE model with 21 billion active parameters. It matched GPT-4-level performance at a fraction of the inference cost, trained on a budget orders of magnitude smaller than US frontier labs. MoE was central to their compute efficiency.

The broader lesson: capacity vs compute

The lasting intellectual contribution of the 2017 paper is a simple but powerful insight: model capacity and inference compute are separable.

Dense models had always coupled these two quantities — doubling parameters meant doubling compute, forever. The MoE paper broke this coupling. A MoE model with 10× the parameters of a dense model can have comparable or lower inference compute, if routing is effective.

This insight reframed how the field thinks about scaling. Instead of “how many parameters can we afford to run?” the question became “how many parameters can we afford to store, and how many do we activate per token?” These have very different answers — storage is much cheaper than compute. MoE exploits exactly this gap.

Every paper on scaling efficiency from 2020 onwards is downstream of this insight.

By the numbers

2017: 137 billion parameters (10× the largest dense models)
2021 Switch Transformer: 1.6 trillion parameters
2022 GLaM: 1.2 trillion parameters
2023 Mixtral 8×7B: first open-source MoE, widely deployed
2024 DeepSeek-V2: GPT-4-competitive quality at fraction of dense cost