Running multiple attention operations in parallel with separate learned projections, then concatenating results. Each head can attend to different positional or semantic relationships. Standard in all modern LLMs.
See also: scaled_dot_product_attention, softmax_attention, attention_residuals