dynamical stability perspective
How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective
The question of which global minima are accessible by a stochastic gradient decent (SGD) algorithm with specific learning rate and batch size is studied from the perspective of dynamical stability. The concept of non-uniformity is introduced, which, together with sharpness, characterizes the stability property of a global minimum and hence the accessibility of a particular SGD algorithm to that global minimum. In particular, this analysis shows that learning rate and batch size play different roles in minima selection. Extensive empirical results seem to correlate well with the theoretical findings and provide further support to these claims.
Reviews: How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective
This paper examines, theoretically and empirically, how does SGD select the global minima it converges to. It first defines two properties ("sharpness" and "non-uniformity") of a fixed point, and how these determine, together with batch size, the maximal learning rate in which the fixed point is stable under SGD dynamics (both in mean and variance). It is then demonstrated numerically how these results relate affect the learning rate and batch size affect the selection of minima, and the dynamics of "escape" from sharp minima". Clarity: This paper is nicely written, and quite clear. Quality: Seems correct, except some fixable errors (see below), and the numerical results seem reasonably convincing. Originality: The results are novel to the best of my knowledge. Significance: The results shed light on the connections between sharpness, learning rate, batch size, and highlight the importance of "non-uniformity". These connections are not well understood and have received attention since ...
How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective
The question of which global minima are accessible by a stochastic gradient decent (SGD) algorithm with specific learning rate and batch size is studied from the perspective of dynamical stability. The concept of non-uniformity is introduced, which, together with sharpness, characterizes the stability property of a global minimum and hence the accessibility of a particular SGD algorithm to that global minimum. In particular, this analysis shows that learning rate and batch size play different roles in minima selection. Extensive empirical results seem to correlate well with the theoretical findings and provide further support to these claims. Papers published at the Neural Information Processing Systems Conference.