====== Layer Normalization ====== Normalizing activations across the feature dimension to stabilize training. Applied either before ([[concepts:prenorm|PreNorm]]) or after ([[concepts:postnorm|PostNorm]]) the sublayer. PreNorm dominates modern [[concepts:llm|LLMs]] but causes [[concepts:hidden_state_growth|hidden-state growth]]. See also: [[concepts:prenorm]], [[concepts:postnorm]], [[concepts:hidden_state_growth]], [[papers:attention_residuals]]