A Rod Flow Model for Adam at the Edge of Stability

Regis, Eric, Chewi, Sinho

arXiv.org Machine Learning 

Neural networks are trained by minimizing loss functions with gradient-based optimizers. Cohen et al. [2021] observed that full-batch gradient descent operates at the edge of stability (EoS): the largest eigenvalue of the Hessian, called the sharpness, first rises (a phase called progressive sharpening) and then hovers at the stability threshold 2/η where η is the learning rate. Cohen et al. [2022] extended this picture to momentum methods and adaptive gradient methods, showing that each optimizer exhibits its own edge of stability. Rather than hovering at 2/η, the relevant quantity--the preconditioned sharpness--hovers at a hyperparameter-dependent threshold that depends on the optimizer (Table 2). In practice, the dominant optimizer in machine learning is Adam [Kingma and Ba, 2015], which differs from gradient descent in two respects.