Information-Theoretic Foundations for Neural Scaling Laws
Jeon, Hong Jun, Van Roy, Benjamin
–arXiv.org Artificial Intelligence
In recent years, foundation models have grown immensely, with some embodying trillions of trainable parameters. While larger models have in general produced better results, they also require much more compute to train. It has become impractical to perform hyperparameter sweeps at the scale of these modern models. This has required bypassing the practice of tuning hyperparameters via extensive trial and error, as was previously common in deep learning. Among other things, hyperparameters control 1) the size, measured in terms of the parameter count p, of the neural network model and 2) the number T of training tokens. If each parameter is adjusted in response to each token then the computational requirements of training scale will the product of these two quantities. For any compute budget C, one should carefully balance between p and T. Too few training tokens leads to model estimation error, while too few parameters gives rise to mispecification error. As evaluating performance across multiple choices of p and T becomes computationally prohibitive at scale, alternative kinds of analysis are required to guide allocation of computational resources. Kaplan et al. [2020] and Hoffmann et al. [2022] have proposed the following procedure for allocating a large compute budget: 1) Evaluate test errors of models produced using various small compute budgets C with many different allocations to parameters p versus training tokens T. 2) Extrapolate to estimate the relation between p and T for large C. 3) Extrapolate to estimate the relation between p and T for large C. To give a sense of scales involved here, Hoffmann et al. [2022] evaluate test errors across "small" models for which p T ranges from around 10
arXiv.org Artificial Intelligence
Jun-27-2024
- Country:
- North America > United States > California > Santa Clara County (0.14)
- Genre:
- Research Report (0.51)
- Technology: