Multi-Head Attention

Running multiple attention operations in parallel with separate learned projections, then concatenating results. Each head can attend to different positional or semantic relationships. Standard in all modern LLMs.

See also: scaled_dot_product_attention, softmax_attention, attention_residuals