Mixture of Experts (MoE)

An architecture where only a subset of parameters (experts) are activated per input, determined by a routing function. Enables scaling total parameters while keeping compute per token low. Used in Kimi Linear and many modern LLMs.