Chinchilla Scaling

The finding from Hoffmann et al. (2022) that LLMs should be trained on roughly 20 tokens per parameter for compute-optimal training. Implies many large models are significantly undertrained. Validates that Attention Residuals improvements hold under this regime.

See also: scaling_laws, neural_scaling, attention_residuals