7. Limitations — the real cost of sparsity

MoE’s promise is seductive: enormous model capacity at low compute cost. The reality is more complicated. Getting MoE to work reliably at scale involves solving a cluster of interrelated engineering and research problems that the 2017 paper exposed but did not fully resolve.

1. Communication overhead across machines

In practice, experts are distributed across many machines (GPUs or TPU chips). When a token is routed to Expert 47, that expert might live on a different machine from the one processing the current batch.

This all-to-all communication — sending each token to its assigned expert and receiving the result — is expensive. On modern hardware with high-speed interconnects (NVLink, InfiniBand), communication is fast but not free. For large numbers of experts spread across hundreds of machines, this communication can consume a significant fraction of total training time.

Dense Transformers have no such overhead — every operation is local to the current machine’s batch of data. This is the hidden tax of MoE that does not appear in the parameter count or theoretical FLOPs.

2. Expert collapse and training instability

The auxiliary balancing loss reduces expert collapse but does not eliminate it. In practice, training MoE models is noticeably more fragile than training dense models:

If α (the balancing loss coefficient) is too small, experts collapse. If it is too large, the auxiliary loss dominates and the model stops learning language.
Some tokens are naturally harder to route than others — semantically ambiguous words like “bank” or “set” may confuse the gating network.
Early in training, random initialisation means some experts receive much more gradient signal than others, creating a feedback loop before the balancing loss kicks in.

The paper used a carefully tuned combination of noise injection, the auxiliary loss, and a “soft” version of the top-k gating to achieve stability. Later work (Switch Transformer, 2021) showed that k=1 (top-1 routing) is actually more stable than k=2, because it removes the need for the gating network to produce calibrated relative weights between two chosen experts.

3. Capacity factor and token dropping

As discussed in Section 4, each expert has a fixed capacity per batch. Tokens exceeding that capacity are dropped — they pass through the MoE layer unchanged via the residual connection.

Dropped tokens get no expert processing. For rare or unusual tokens that consistently attract a popular expert, this can degrade quality. Setting the capacity factor higher (2.0 instead of 1.0) reduces dropping but increases memory usage, partially negating MoE’s efficiency advantage.

There is no free lunch: you must tune capacity factor, k, and n for each model size and hardware configuration.

4. Inference complexity

During inference (not training), every expert must be loaded into memory, even if only 2 of 1,000 are active for any given token. A 137-billion-parameter model must fit on enough hardware to hold all experts simultaneously.

For deployment, this means the hardware requirement scales with total parameter count (like a dense model), not with active parameters. A user serving a MoE model pays the storage and memory cost of the full model but gets the inference compute cost of a smaller active portion.

This made MoE impractical for small-scale deployment in 2017. It remained a “large-cluster research” technique until hardware improved and engineering sophistication caught up — which is why it only became mainstream in products like Mixtral (2023, Paper 18) over a decade later.

5. Interpretability is harder

In a dense model, you can study what each FFN neuron responds to. In a MoE model, you must additionally understand the routing decisions: which types of tokens go to which experts, and why. This is a two-level interpretability problem.

Researchers have found that MoE experts do tend to specialise — some handle punctuation, some handle named entities, some handle grammatical function words — but the specialisation is not clean or guaranteed. The gating function is learned end-to-end and can produce routing decisions that are hard to interpret or predict.

What the paper got right despite the limitations

The paper established that sparse conditional computation could scale neural networks far beyond the dense compute barrier. Every limitation above has been partially addressed by subsequent work:

Communication overhead: better hardware interconnects and expert parallelism strategies
Training instability: top-1 routing (Switch Transformer), better initialisations
Token dropping: improved capacity scheduling, expert choice routing (St. John et al., 2022)
Inference cost: efficient serving systems (DeepSpeed, Megablocks)

The problems were real but solvable. The core idea — that model capacity and compute cost can be decoupled through learned routing — was correct and has proven durable.