MoE (Mixture of Experts) layer

Appears in 1 paper

A drop-in replacement for the FFN sub-layer in a Transformer.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

A drop-in replacement for the FFN sub-layer in a Transformer. Contains n expert networks and a gating network. For each token, routes to top-k experts and outputs a weighted sum of their outputs. Keeps attention sub-layers dense and shared across all tokens.