Transformer

The dominant architecture for LLMs, built from alternating self-attention and feed-forward layers with residual connections and layer normalization. The original design used PostNorm; modern variants use PreNorm. Attention Residuals modifies how the residual stream accumulates across layers.