====== Transformer ======

The dominant architecture for [[concepts:llm|LLMs]], built from alternating self-attention and feed-forward layers with [[concepts:residual_connections|residual connections]] and [[concepts:layer_normalization|layer normalization]]. The original design used [[concepts:postnorm|PostNorm]]; modern variants use [[concepts:prenorm|PreNorm]]. [[papers:attention_residuals|Attention Residuals]] modifies how the residual stream accumulates across layers.

See also: [[concepts:softmax_attention]], [[concepts:multi_head_attention]], [[concepts:prenorm]], [[concepts:residual_connections]]