MoE · Mixture of Experts
A sparse model that activates only a subset of expert sub-networks per token, cutting compute per token at scale.
A sparse model that activates only a subset of expert sub-networks per token, cutting compute per token at scale.