Linear Attention

Attention variants that replace the O(n²) softmax with a kernelized decomposition, reducing sequence-length cost to linear. Used in Kimi Linear to handle long contexts efficiently. Trades some expressiveness for computational savings.

See also: kimi_linear, softmax_attention, moe, attention_residuals