====== PreNorm ====== Applying layer normalization **before** the sublayer transformation (attention or FFN). Dominant in modern LLMs for training stability, but its unweighted accumulation causes hidden-state magnitudes to grow as O(L) with depth, diluting each layer's contribution. See also: [[concepts:residual_connections]], [[concepts:layer_normalization]], [[concepts:postnorm]], [[concepts:hidden_state_growth]], [[papers:attention_residuals]]