concepts:softmax_attention
Softmax Attention
The standard attention mechanism: computes dot-product similarity between a query and keys, applies softmax to produce a probability distribution, then takes a weighted sum of values. In Attention Residuals, softmax attention is repurposed for depth-wise aggregation across layer outputs instead of sequence positions.
See also: scaled_dot_product_attention, multi_head_attention, attention_residuals
concepts/softmax_attention.txt · Last modified: by aethersync
