1. Context — the compute wall of 2017

In 2017, the AI community was grappling with a tantalising problem: everyone suspected that bigger models were better models, but nobody could afford to find out.

The intuition was clear. More parameters meant more capacity to memorise patterns. More capacity meant better language understanding, better translation, better reasoning. The relationship seemed direct and powerful. But there was a brutal practical constraint: every parameter in a dense neural network fires on every input. If you double the parameters, you double the compute required for every forward pass. If you increase the model by 100×, you need 100× the compute — on every single token, at every single training step, and at every inference call.

In 2017, training a model with even a few billion parameters was already at the limits of what was financially and technically feasible. The largest language models of the era had hundreds of millions to low billions of parameters. Google’s seq2seq translation models, Bahdanau’s attention networks, the original Transformer — these were already expensive enough.

Researchers were starting to ask: is there a way to have a very large model — enormous parameter capacity — without paying the full compute cost of that size on every token?

The answer had a name from the 1990s: Mixture of Experts.

The Mixture of Experts idea originated in work by Jacobs, Jordan, Nowlan, and Hinton in 1991. The core concept: instead of one large neural network processing every input, have many specialised sub-networks (experts) and a gating mechanism that decides which expert to use for each input. Most experts are not consulted for any given input — only the most relevant ones. Compute cost scales with the number of active experts, not the total number of experts.

The idea had not scaled well through the 1990s and 2000s. Training many experts together was unstable — a few experts would get all the training signal while others learned nothing. The gating network was hard to train. The architecture did not map onto the hardware of the era.

In 2017, Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean at Google Brain revisited this 1990s idea with modern tools: GPU clusters, better optimisers, the Transformer architecture, and a clever fix for training instability. Their paper, “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” demonstrated a model with 137 billion parameters — more than 10× the largest models of the era — that trained and ran at a cost comparable to much smaller dense models.

The title was a deliberate provocation. Outrageously large. And it worked.