A network architecture that processes sequences by maintaining a hidden state updated at each time step. Attention Residuals has a recurrent interpretation: the softmax attention over all prior layer outputs can be viewed as a weighted recurrence, connecting transformer depth to recurrent computation.
See also: attention_residuals, softmax_attention, gradient_highway