The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima
Bartlett, Peter L., Long, Philip M., Bousquet, Olivier
–arXiv.org Artificial Intelligence
The broad practical impact of deep learning has heightened interest in many of its surprising characteristics: simple gradient methods applied to deep neural networks seem to efficiently optimize nonconvex criteria, reliably giving a near-perfect fit to training data, but exhibiting good predictive accuracy nonetheless [see Bartlett et al., 2021]. Optimization methodology is widely believed to affect statistical performance by imposing some kind of implicit regularization, and there has been considerable effort devoted to understanding the behavior of optimization methods and the nature of solutions that they find. For instance, Barrett and Dherin [2020] and Smith et al. [2021] show that discrete-time gradient descent and stochastic gradient descent can be viewed as gradient flow methods applied to penalized losses that encourage smoothness, and Soudry et al. [2018] amd Azulay et al. [2021] identify the implicit regularization imposed by gradient flow in specific examples, including linear networks. We consider Sharpness-Aware Minimization (SAM), a recently introduced [Foret et al., 2021] gradient optimization method that has exhibited substantial improvements in prediction performance for deep networks applied to image classification [Foret et al., 2021] and NLP [Bahri et al., 2022] problems. Also affiliated with University of California, Berkeley.
arXiv.org Artificial Intelligence
Apr-11-2023
- Country:
- North America > United States > California > Alameda County > Berkeley (0.24)
- Genre:
- Research Report (0.40)
- Technology: