Token dropping
When an expert receives more tokens than its capacity allows, excess tokens skip the MoE layer and pass through the residual connection unchanged.
When an expert receives more tokens than its capacity allows, excess tokens skip the MoE layer and pass through the residual connection unchanged. A necessary engineering trade-off: without capacity limits, overloaded experts become hardware bottlenecks. With too much dropping, token representations lose expert processing.