A neural network trained on large text corpora to model language. Modern LLMs use transformer architectures with PreNorm and residual connections, and increasingly MoE for efficiency. Scaling laws govern their performance gains.
See also: softmax_attention, moe, scaling_laws, prenorm