====== Attention Residuals ====== A technique by Kimi Team that replaces fixed unit-weight residual accumulation with learned softmax attention over preceding layer outputs. See also: [[concepts:residual_connections]], [[concepts:prenorm]], [[concepts:block_attnres]], [[concepts:kimi_linear]], [[concepts:moe]], [[concepts:softmax_attention]], [[concepts:scaling_laws]], [[concepts:layer_pruning]], [[concepts:gradient_highway]], [[concepts:hidden_state_growth]], [[concepts:rnn]]