Sparse computation
The opposite of dense: only a fraction of parameters are active for any given input.
The opposite of dense: only a fraction of parameters are active for any given input. MoE achieves sparsity by routing each token to k of n experts rather than computing all n. A model with 100 experts and k=2 uses 2% of its total FFN parameters per token.