Token dropping

Appears in 1 paper

When an expert receives more tokens than its capacity allows, excess tokens skip the MoE layer and pass through the residual connection unchanged.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

When an expert receives more tokens than its capacity allows, excess tokens skip the MoE layer and pass through the residual connection unchanged. A necessary engineering trade-off: without capacity limits, overloaded experts become hardware bottlenecks. With too much dropping, token representations lose expert processing.