====== Linear Attention ====== Attention variants that replace the O(n²) softmax with a kernelized decomposition, reducing sequence-length cost to linear. Used in [[concepts:kimi_linear|Kimi Linear]] to handle long contexts efficiently. Trades some expressiveness for computational savings. See also: [[concepts:kimi_linear]], [[concepts:softmax_attention]], [[concepts:moe]], [[papers:attention_residuals]]