====== Chinchilla Scaling ====== The finding from Hoffmann et al. (2022) that [[concepts:llm|LLMs]] should be trained on roughly 20 tokens per parameter for compute-optimal training. Implies many large models are significantly undertrained. Validates that [[papers:attention_residuals|Attention Residuals]] improvements hold under this regime. See also: [[concepts:scaling_laws]], [[concepts:neural_scaling]], [[papers:attention_residuals]]