concepts:multi_head_attention
Multi-Head Attention
Running multiple attention operations in parallel with separate learned projections, then concatenating results. Each head can attend to different positional or semantic relationships. Standard in all modern LLMs.
See also: scaled_dot_product_attention, softmax_attention, attention_residuals
concepts/multi_head_attention.txt · Last modified: by aethersync
