Goto

Collaborating Authors

 elr



5aea56eefab60e06f35016478e21aae6-Supplemental-Conference.pdf

Neural Information Processing Systems

A.2 DerivationsforSection3.1 We begin with a formal derivation of the formulas in Section 3.1. We remind that we consider a function F(θ) whose parameters can be split inton SI groups: θ = (θ1,...,θn). We solve an optimization problem(1)with projected gradient descent(2). Remark2 The above formulation allegedly lacks the third (divergent) regime. If, conversely, η > 1Pn i=1αi, then at each iteration at least one of the individual ELRs exceeds its convergencethreshold: ηi > 1αi.




Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes

Neural Information Processing Systems

A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.


Efficient Hyperparameter Tuning via Trajectory Invariance Principle

Li, Bingrui, Wen, Jiaxin, Zhou, Zhanpeng, Zhu, Jun, Chen, Jianfei

arXiv.org Artificial Intelligence

As hyperparameter tuning becomes increasingly costly at scale, efficient tuning methods are essential. Yet principles for guiding hyperparameter tuning remain limited. In this work, we seek to establish such principles by considering a broad range of hyperparameters, including batch size, learning rate, and weight decay. We identify a phenomenon we call trajectory invariance, where pre-training loss curves, gradient noise, and gradient norm exhibit invariance--closely overlapping--with respect to a quantity that combines learning rate and weight decay. This phenomenon effectively reduces the original two-dimensional hyperparameter space to one dimension, yielding an efficient tuning rule: follow the salient direction revealed by trajectory invariance. Furthermore, we refine previous scaling laws and challenge several existing viewpoints. Overall, our work proposes new principles for efficient tuning and inspires future research on scaling laws.



A Theory

Neural Information Processing Systems

In this section, we provide proofs and additional details for Section 3. A.1 Norm constraint: total vs. individual We begin with a formal derivation of the formulas in Section 3.1. Then the following results hold: 1. η < The above formulation allegedly lacks the third (divergent) regime. For the second statement, based on eq. A.4 More formally on the results of Section 3.2 In this section, we provide a more formal argument on the results of Section 3.2. According to the results of Section 3.1, solving it with the projected gradient method Here we provide additional plots depicting the behavior of individual ELRs in the toy example at the end of Section 3.2.



Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes

Neural Information Processing Systems

A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence.