Capacity factor

Appears in 1 paper

A multiplier that sets the maximum number of tokens each expert can process per batch: `capacity = (batch_tokens / n_experts) × capacity_factor`.

As used in Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

A multiplier that sets the maximum number of tokens each expert can process per batch: capacity = (batch_tokens / n_experts) × capacity_factor. A factor of 1.0 means each expert gets exactly its fair share; 2.0 gives each expert twice as much buffer. Tokens exceeding an expert's capacity are dropped. Higher capacity factor reduces dropping but increases memory cost.

Paper 09 — Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer →

Appears in papers