Residual Connections

Skip connections that add a layer's input to its output: h_l = h_{l-1} + f(h_{l-1}). Enable gradient flow in deep networks but accumulate all prior outputs with fixed unit weights, causing dilution at depth.

See also: attention_residuals, prenorm, gradient_highway, hidden_state_growth, layer_pruning