====== Mixture of Experts (MoE) ====== An architecture where only a subset of parameters (experts) are activated per input, determined by a routing function. Enables scaling total parameters while keeping compute per token low. Used in [[concepts:kimi_linear|Kimi Linear]] and many modern [[concepts:llm|LLMs]]. See also: [[concepts:expert_routing]], [[concepts:kimi_linear]], [[papers:attention_residuals]]