Softmax Attention

The standard attention mechanism: computes dot-product similarity between a query and keys, applies softmax to produce a probability distribution, then takes a weighted sum of values. In Attention Residuals, softmax attention is repurposed for depth-wise aggregation across layer outputs instead of sequence positions.

See also: scaled_dot_product_attention, multi_head_attention, attention_residuals