concepts:chinchilla_scaling
Chinchilla Scaling
The finding from Hoffmann et al. (2022) that LLMs should be trained on roughly 20 tokens per parameter for compute-optimal training. Implies many large models are significantly undertrained. Validates that Attention Residuals improvements hold under this regime.
See also: scaling_laws, neural_scaling, attention_residuals
concepts/chinchilla_scaling.txt · Last modified: by aethersync
