====== Scaled Dot-Product Attention ====== The core computation behind [[concepts:softmax_attention]]: softmax(QK^T / sqrt(d_k)) V. The sqrt(d_k) scaling prevents dot products from growing large in high dimensions, which would push softmax into saturated regions with vanishing gradients. See also: [[concepts:softmax_attention]], [[concepts:multi_head_attention]], [[papers:attention_residuals]]