Sparse computation

Appears in 1 paper

The opposite of dense: only a fraction of parameters are active for any given input.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

The opposite of dense: only a fraction of parameters are active for any given input. MoE achieves sparsity by routing each token to k of n experts rather than computing all n. A model with 100 experts and k=2 uses 2% of its total FFN parameters per token.