elr
5aea56eefab60e06f35016478e21aae6-Supplemental-Conference.pdf
A.2 DerivationsforSection3.1 We begin with a formal derivation of the formulas in Section 3.1. We remind that we consider a function F(θ) whose parameters can be split inton SI groups: θ = (θ1,...,θn). We solve an optimization problem(1)with projected gradient descent(2). Remark2 The above formulation allegedly lacks the third (divergent) regime. If, conversely, η > 1Pn i=1αi, then at each iteration at least one of the individual ELRs exceeds its convergencethreshold: ηi > 1αi.
- Asia > Russia (0.14)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
- North America > United States > Illinois > Cook County > Evanston (0.04)
- Europe > United Kingdom (0.04)
Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes
A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.
Efficient Hyperparameter Tuning via Trajectory Invariance Principle
Li, Bingrui, Wen, Jiaxin, Zhou, Zhanpeng, Zhu, Jun, Chen, Jianfei
As hyperparameter tuning becomes increasingly costly at scale, efficient tuning methods are essential. Yet principles for guiding hyperparameter tuning remain limited. In this work, we seek to establish such principles by considering a broad range of hyperparameters, including batch size, learning rate, and weight decay. We identify a phenomenon we call trajectory invariance, where pre-training loss curves, gradient noise, and gradient norm exhibit invariance--closely overlapping--with respect to a quantity that combines learning rate and weight decay. This phenomenon effectively reduces the original two-dimensional hyperparameter space to one dimension, yielding an efficient tuning rule: follow the salient direction revealed by trajectory invariance. Furthermore, we refine previous scaling laws and challenge several existing viewpoints. Overall, our work proposes new principles for efficient tuning and inspires future research on scaling laws.
A Theory
In this section, we provide proofs and additional details for Section 3. A.1 Norm constraint: total vs. individual We begin with a formal derivation of the formulas in Section 3.1. Then the following results hold: 1. η < The above formulation allegedly lacks the third (divergent) regime. For the second statement, based on eq. A.4 More formally on the results of Section 3.2 In this section, we provide a more formal argument on the results of Section 3.2. According to the results of Section 3.1, solving it with the projected gradient method Here we provide additional plots depicting the behavior of individual ELRs in the toy example at the end of Section 3.2.
- Asia > Russia (0.14)
- Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes
A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence.