====== Softmax Attention ======

The standard attention mechanism: computes dot-product similarity between a query and keys, applies softmax to produce a probability distribution, then takes a weighted sum of values. In [[papers:attention_residuals|Attention Residuals]], softmax attention is repurposed for depth-wise aggregation across layer outputs instead of sequence positions.

See also: [[concepts:scaled_dot_product_attention]], [[concepts:multi_head_attention]], [[papers:attention_residuals]]