AITopics | heavy-tailed gradient noise

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Neural Information Processing SystemsDec-25-2025, 20:40:31 GMT

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using $\alpha$-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of `preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.

exit time analysis, heavy-tailed gradient noise, stochastic gradient descent, (5 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.58)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.58)

Add feedback

Reviews: First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Neural Information Processing SystemsJan-26-2025, 08:24:01 GMT

For Reviewer #1's concern about making theory, I tend to be open-minded since I can not find solid evidence that the paper is making theory only. For Reviewer #4's comment about the over-claim of the result the paper proved, my take is follows. First, for many problems, the true local minima enjoys the flat basin. A famous example I have is the following paper: McGoff, Kevin A., et al. "The Local Edge Machine: inference of dynamic models of gene regulation." Second, the authors have explained the motivation of using the Levy process to model the noise.

exit time analysis, heavy-tailed gradient noise, stochastic gradient descent, (6 more...)

Neural Information Processing Systems

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.38)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

Reviews: First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Neural Information Processing SystemsJan-26-2025, 08:23:51 GMT

The reviewers liked the paper and appreciated the authors feedback. The authors should implement all the recommendations from the reviewers in the final version of the paper.

exit time analysis, heavy-tailed gradient noise, stochastic gradient descent, (1 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Neural Information Processing SystemsOct-10-2024, 17:07:04 GMT

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using \alpha -stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit.

exit time analysis, heavy-tailed gradient noise, stochastic gradient descent, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Nguyen, Thanh Huy, Simsekli, Umut, Gurbuzbalaban, Mert, RICHARD, Gaël

Neural Information Processing SystemsMar-18-2020, 20:30:32 GMT

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using $\alpha$-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit.

exit time analysis, heavy-tailed gradient noise, stochastic gradient descent, (3 more...)

Neural Information Processing Systems

Genre: Research Report (0.37)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Filters

Collaborating Authors

heavy-tailed gradient noise

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Reviews: First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Reviews: First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise