The alignment property of noise and how it helps select flat minima A stability analysis
–Neural Information Processing Systems
The phenomenon that stochastic gradient descent (SGD) favors flat minima has played a critical role in understanding the implicit regularization of SGD. In this paper, we provide an explanation of this striking phenomenon by relating the particular noise structure of SGD to its linear stability (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss.
Neural Information Processing Systems
Jan-24-2025, 08:21:31 GMT