====== PostNorm ======

Applying [[concepts:layer_normalization|layer normalization]] **after** the sublayer and residual addition. Used in the original Transformer but largely replaced by [[concepts:prenorm|PreNorm]] in modern [[concepts:llm|LLMs]] due to training instability at depth. PostNorm does not exhibit [[concepts:hidden_state_growth|hidden-state growth]] but is harder to train.

See also: [[concepts:prenorm]], [[concepts:layer_normalization]], [[concepts:hidden_state_growth]], [[papers:attention_residuals]]