Noisy top-k gating
The specific gating formulation from the 2017 paper: raw logits have Gaussian noise added before top-k selection during training.
The specific gating formulation from the 2017 paper: raw logits have Gaussian noise added before top-k selection during training. The noise encourages exploration — different tokens may be routed to different experts across training steps, even if their base logits would always select the same experts. This diversity of routing helps all experts receive training signal.