concepts:moe
Mixture of Experts (MoE)
An architecture where only a subset of parameters (experts) are activated per input, determined by a routing function. Enables scaling total parameters while keeping compute per token low. Used in Kimi Linear and many modern LLMs.
See also: expert_routing, kimi_linear, attention_residuals
concepts/moe.txt · Last modified: by aethersync
