The position-wise MLP applied after each attention layer in a Transformer block. Typically two linear transformations with a nonlinearity: FFN(x) = W2 · act(W1 · x). In MoE architectures, the FFN is replaced by multiple expert FFNs with expert routing.
See also: transformer, moe, residual_connections