The alignment property of SGD noise and how it helps select flat minima: A stability analysis

Oct-10-2024, 05:05:15 GMT–Neural Information Processing Systems

The phenomenon that stochastic gradient descent (SGD) favors flat minima has played a critical role in understanding the implicit regularization of SGD. In this paper, we provide an explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the sharpness---as measured by the Frobenius norm of the Hessian---is bounded \emph{independently} of the model size and sample size.

alignment property, sgd noise, stability analysis, (4 more...)

Neural Information Processing Systems

Oct-10-2024, 05:05:15 GMT

Conferences Web Page

Add feedback

Genre:
- Play > Prospect (0.66)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)