User Tools

Site Tools


concepts:chinchilla_scaling

Chinchilla Scaling

The finding from Hoffmann et al. (2022) that LLMs should be trained on roughly 20 tokens per parameter for compute-optimal training. Implies many large models are significantly undertrained. Validates that Attention Residuals improvements hold under this regime.

See also: scaling_laws, neural_scaling, attention_residuals

concepts/chinchilla_scaling.txt · Last modified: by aethersync

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki