Goto

Collaborating Authors

 heavy-tailed gradient noise


First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Neural Information Processing Systems

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using $\alpha$-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of `preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.


Reviews: First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Neural Information Processing Systems

For Reviewer #1's concern about making theory, I tend to be open-minded since I can not find solid evidence that the paper is making theory only. For Reviewer #4's comment about the over-claim of the result the paper proved, my take is follows. First, for many problems, the true local minima enjoys the flat basin. A famous example I have is the following paper: McGoff, Kevin A., et al. "The Local Edge Machine: inference of dynamic models of gene regulation." Second, the authors have explained the motivation of using the Levy process to model the noise.


Reviews: First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Neural Information Processing Systems

The reviewers liked the paper and appreciated the authors feedback. The authors should implement all the recommendations from the reviewers in the final version of the paper.


First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Neural Information Processing Systems

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using \alpha -stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit.


First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Neural Information Processing Systems

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using $\alpha$-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit.