====== Multi-Head Attention ====== Running multiple [[concepts:scaled_dot_product_attention|attention]] operations in parallel with separate learned projections, then concatenating results. Each head can attend to different positional or semantic relationships. Standard in all modern [[concepts:llm|LLMs]]. See also: [[concepts:scaled_dot_product_attention]], [[concepts:softmax_attention]], [[papers:attention_residuals]]