Mixture of Experts (MoE)
A layer type where multiple expert networks are available, and a router learns which expert(s) to use for each input.
A layer type where multiple expert networks are available, and a router learns which expert(s) to use for each input. Mistral's follow-up, Mixtral 8×7B, uses this: 8 experts of 7B each, but only 2 experts activate per token. Provides larger model capacity with less compute than fully dense networks.