====== Multi-Head Attention ======

Running multiple [[concepts:scaled_dot_product_attention|attention]] operations in parallel with separate learned projections, then concatenating results. Each head can attend to different positional or semantic relationships. Standard in all modern [[concepts:llm|LLMs]].

See also: [[concepts:scaled_dot_product_attention]], [[concepts:softmax_attention]], [[papers:attention_residuals]]