Gemstones: A Model Suite for Multi-Faceted Scaling Laws
McLeish, Sean, Kirchenbauer, John, Miller, David Yu, Singh, Siddharth, Bhatele, Abhinav, Goldblum, Micah, Panda, Ashwinee, Goldstein, Tom
–arXiv.org Artificial Intelligence
Our models, called the Gemstones Scaling laws are typically fit using a family of because they are loosely based on scaled-down variants models with a narrow range of frozen hyperparameter of the Gemma architecture, vary in their parameter count, choices. In this work we study scaling width/depth ratio, training tokens, learning rates, and laws using a wide range of architecture and hyperparameter cooldown schedules. By fitting scaling laws to these choices, and highlight their impact on checkpoints, we confirm that scaling law parameters and resulting prescriptions. As a primary artifact of interpretations indeed depend strongly on the selection of our research, we release the Gemstones: the most models and fitting procedure used, and we quantify the comprehensive open-source scaling law dataset degree to which these decisions impact predictions. By to date, consisting of over 4000 checkpoints from exploiting the variation among our model checkpoints, we transformers with up to 2 billion parameters; these also fit a number of unique scaling laws and analyze their models have been trained with different learning predictions to discern whether they are consistent with rates, cooldown schedules, and architectural design choices we see in industry models.
arXiv.org Artificial Intelligence
Feb-7-2025
- Country:
- North America > United States (0.67)
- Genre:
- Research Report (0.82)
- Industry:
- Government > Regional Government (0.46)
- Technology: