Under PreNorm residual connections, hidden-state magnitudes grow as O(L) with depth because each layer adds a roughly unit-magnitude output to the running sum. This progressively dilutes each layer's relative contribution and buries early-layer information.
See also: prenorm, residual_connections, attention_residuals