Noisy top-k gating

Appears in 1 paper

The specific gating formulation from the 2017 paper: raw logits have Gaussian noise added before top-k selection during training.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

The specific gating formulation from the 2017 paper: raw logits have Gaussian noise added before top-k selection during training. The noise encourages exploration — different tokens may be routed to different experts across training steps, even if their base logits would always select the same experts. This diversity of routing helps all experts receive training signal.

Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

Appears in papers