Mixture of Experts (MoE)

Appears in 1 paper

A layer type where multiple expert networks are available, and a router learns which expert(s) to use for each input.

A layer type where multiple expert networks are available, and a router learns which expert(s) to use for each input. Mistral's follow-up, Mixtral 8×7B, uses this: 8 experts of 7B each, but only 2 experts activate per token. Provides larger model capacity with less compute than fully dense networks.