Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Bordelon, Blake, Noci, Lorenzo, Li, Mufan Bill, Hanin, Boris, Pehlevan, Cengiz

arXiv.org Machine Learning 

The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses µP parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of 1/ depth in combination with the µP parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit. All the missing datapoints indicate that the corresponding run diverged. Increasing the number of parameters in a neural network has led to consistent and often dramatic improvements in model quality (Kaplan et al., 2020; Hoffmann et al., 2022; Zhai et al., 2022; Klug & Heckel, 2023; OpenAI, 2023). To realize these gains, however, it is typically necessary to conduct by trial-and-error a grid search for optimal choices of hyperparameters, such as learning rates. The above examples are taken after 20 epochs on CIFAR-10. Runs that exceed a target loss of 0.5 are removed from the plot for visual clarity. To combat this, an influential recent line of work by Yang & Hu (2021) proposed the so-called µP parameterization, which seeks to develop principles by which optimal hyperparameters from small networks can be reused -- or transferred -- to larger networks (Yang et al., 2021).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found