User Tools

Site Tools


concepts:scaled_dot_product_attention

Scaled Dot-Product Attention

The core computation behind softmax_attention: softmax(QK^T / sqrt(d_k)) V. The sqrt(d_k) scaling prevents dot products from growing large in high dimensions, which would push softmax into saturated regions with vanishing gradients.

See also: softmax_attention, multi_head_attention, attention_residuals

concepts/scaled_dot_product_attention.txt · Last modified: by aethersync

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki