learning rate phase
Implicit bias of deep linear networks in the large learning rate phase
Huang, Wei, Du, Weitao, Da Xu, Richard Yi, Liu, Chunrui
Most theoretical studies explaining the regularization effect in deep learning have only focused on gradient descent with a sufficient small learning rate or even gradient flow (infinitesimal learning rate). Such researches, however, have neglected a reasonably large learning rate applied in most practical applications. In this work, we characterize the implicit bias effect of deep linear networks for binary classification using the logistic loss in the large learning rate regime, inspired by the seminal work by Lewkowycz et al. [26] in a regression setting with squared loss. They found a learning rate regime with a large stepsize named the catapult phase, where the loss grows at the early stage of training and eventually converges to a minimum that is flatter than those found in the small learning rate regime. We claim that depending on the separation conditions of data, the gradient descent iterates will converge to a flatter minimum in the catapult phase. We rigorously prove this claim under the assumption of degenerate data by overcoming the difficulty of the non-constant Hessian of logistic loss and further characterize the behavior of loss and Hessian for non-separable data. Finally, we demonstrate that flatter minima in the space spanned by non-separable data along with the learning rate in the catapult phase can lead to better generalization empirically.
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States (0.04)
The large learning rate phase of deep learning: the catapult mechanism
Lewkowycz, Aitor, Bahri, Yasaman, Dyer, Ethan, Sohl-Dickstein, Jascha, Gur-Ari, Guy
The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks. At large learning rates the model captures qualitatively distinct phenomena, including the convergence of gradient descent dynamics to flatter minima. One key prediction of our model is a narrow range of large, stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. Furthermore, we find that the optimal performance in such settings is often found in the large learning rate phase. We believe our results shed light on characteristics of models trained at different learning rates. In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice.