AITopics

2501.12957

Genre: Research Report > New Finding (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.67)

Oikonomou, Dimitris, Loizou, Nicolas

Stochastic Polyak Step-sizes and Momentum: Convergence Guarantees and Practical Performance

arXiv.org Machine LearningJun-6-2024

Stochastic gradient descent with momentum, also known as Stochastic Heavy Ball method (SHB), is one of the most popular algorithms for solving large-scale stochastic optimization problems in various machine learning tasks. In practical scenarios, tuning the step-size and momentum parameters of the method is a prohibitively expensive and time-consuming process. In this work, inspired by the recent advantages of stochastic Polyak step-size in the performance of stochastic gradient descent (SGD), we propose and explore new Polyak-type variants suitable for the update rule of the SHB method. In particular, using the Iterate Moving Average (IMA) viewpoint of SHB, we propose and analyze three novel step-size selections: MomSPS$_{\max}$, MomDecSPS, and MomAdaSPS. For MomSPS$_{\max}$, we provide convergence guarantees for SHB to a neighborhood of the solution for convex and smooth problems (without assuming interpolation). If interpolation is also satisfied, then using MomSPS$_{\max}$, SHB converges to the true solution at a fast rate matching the deterministic HB. The other two variants, MomDecSPS and MomAdaSPS, are the first adaptive step-sizes for SHB that guarantee convergence to the exact minimizer without prior knowledge of the problem parameters and without assuming interpolation. The convergence analysis of SHB is tight and obtains the convergence guarantees of SGD with stochastic Polyak step-sizes as a special case. We supplement our analysis with experiments that validate the theory and demonstrate the effectiveness and robustness of the new algorithms.

convergence, momsps max, shb, (15 more...)

2406.04142

Country:

North America > Canada > Ontario > Toronto (0.14)
North America > United States > New York (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)

Sato, Naoki, Iiduka, Hideaki

Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization

arXiv.org Artificial IntelligenceFeb-3-2024

Although convergence analyses of SGD with momentum for nonconvex functions have been provided While stochastic gradient descent (SGD) with momentum [11, 16, 39], none of them explain why convergence has fast convergence and excellent generalizability, a is faster than with SGD. The generalizability of SGD with theoretical explanation for this is lacking. In this paper, momentum has been well studied, and various experimental we show that SGD with momentum smooths the objective findings have been reported. While it has been suggested function, the degree of which is determined by the learning that momentum plays a role in reducing stochastic noise rate, the batch size, the momentum factor, the variance of [8, 11], stochastic noise has been shown to increase generalizability the stochastic gradient, and the upper bound of the gradient [18, 37, 59], and it has been claimed that stochastic norm. This theoretical finding reveals why momentum improves noise can help an algorithm escape from local solutions generalizability and provides new insights into the with poor generalizability [10, 15, 20, 28, 32].

batch size, sgd, shb, (14 more...)

2402.02325

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
North America > Canada > Ontario > Toronto (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Dang, Anh, Babanezhad, Reza, Vaswani, Sharan

Noise-adaptive (Accelerated) Stochastic Heavy-Ball Momentum

arXiv.org Artificial IntelligenceJan-12-2024

We analyze the convergence of stochastic heavy ball (SHB) momentum in the smooth, strongly-convex setting. Kidambi et al. (2018) show that SHB (with small mini-batches) cannot attain an accelerated rate of convergence even for quadratics, and conjecture that the practical gain of SHB is a by-product of mini-batching. We substantiate this claim by showing that SHB can obtain an accelerated rate when the mini-batch size is larger than some threshold. In particular, for strongly-convex quadratics with condition number $\kappa$, we prove that SHB with the standard step-size and momentum parameters results in an $O\left(\exp(-\frac{T}{\sqrt{\kappa}}) + \sigma \right)$ convergence rate, where $T$ is the number of iterations and $\sigma^2$ is the variance in the stochastic gradients. To ensure convergence to the minimizer, we propose a multi-stage approach that results in a noise-adaptive $O\left(\exp\left(-\frac{T}{\sqrt{\kappa}} \right) + \frac{\sigma}{T}\right)$ rate. For general strongly-convex functions, we use the averaging interpretation of SHB along with exponential step-sizes to prove an $O\left(\exp\left(-\frac{T}{\kappa} \right) + \frac{\sigma^2}{T} \right)$ convergence to the minimizer in a noise-adaptive manner. Finally, we empirically demonstrate the effectiveness of the proposed algorithms.

convergence, convergence rate, shb, (14 more...)

2401.06738

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports > Tennis (0.61)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

arXiv.org Artificial IntelligenceDec-22-2023

Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise

Pan, Rui, Liu, Yuxing, Wang, Xiaoyu, Zhang, Tong

Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing deep learning models. In contrast to its empirical popularity, the understanding of its theoretical property is still quite limited, especially under the standard anisotropic gradient noise condition for quadratic regression problems. Although it is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings, there is no rigorous theoretical analysis. In this paper, we fill this theoretical gap by establishing a non-asymptotic convergence bound for stochastic heavy-ball methods with step decay scheduler on quadratic objectives, under the anisotropic gradient noise condition. As a direct implication, we show that heavy-ball momentum can provide $\tilde{\mathcal{O}}(\sqrt{\kappa})$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate with respect to the stochastic variance term. The combined effect implies an overall convergence rate within log factors from the statistical minimax rate. This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning, where a smaller number of iterations can significantly reduce the number of communication rounds, leading to acceleration in practice.

arxiv preprint arxiv, convergence rate, objective, (12 more...)

2312.14567

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Middle East > Jordan (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Leisure & Entertainment > Sports > Tennis (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

arXiv.org Artificial IntelligenceFeb-15-2023

Almost Sure Saddle Avoidance of Stochastic Gradient Methods without the Bounded Gradient Assumption

Liu, Jun, Yuan, Ye

We prove that various stochastic gradient descent methods, including the stochastic gradient descent (SGD), stochastic heavy-ball (SHB), and stochastic Nesterov's accelerated gradient (SNAG) methods, almost surely avoid any strict saddle manifold. To the best of our knowledge, this is the first time such results are obtained for SHB and SNAG methods. Moreover, our analysis expands upon previous studies on SGD by removing the need for bounded gradients of the objective function and uniformly bounded noise. Instead, we introduce a more practical local boundedness assumption for the noisy gradient, which is naturally satisfied in empirical risk minimization problems typically seen in training of neural networks. Keywords: Stochastic gradient descent, stochastic heavy-ball, stochastic Nesterov's accelerated gradient, almost sure saddle avoidance

artificial intelligence, assumption, machine learning, (15 more...)

2302.07862

Country:

Asia > Middle East > Jordan (0.05)
North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
Asia > China > Hubei Province > Wuhan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Wu, Diyuan, Kungurtsev, Vyacheslav, Mondelli, Marco

Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence

arXiv.org Artificial IntelligenceFeb-5-2023

The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum. To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.

artificial intelligence, machine learning, shb, (16 more...)

2210.06819

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)
Europe > Russia (0.04)
(2 more...)

Genre: Research Report (0.49)

Industry: Leisure & Entertainment > Sports > Tennis (0.62)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Sebbouh, Othmane, Gower, Robert M., Defazio, Aaron

On the convergence of the Stochastic Heavy Ball Method

arXiv.org Machine LearningJun-14-2020

We provide a comprehensive analysis of the Stochastic Heavy Ball (SHB) method (otherwise known as the momentum method), including a convergence of the last iterate of SHB, establishing a faster rate of convergence than existing bounds on the last iterate of Stochastic Gradient Descent (SGD) in the convex setting. Our analysis shows that unlike SGD, no final iterate averaging is necessary with the SHB method. We detail new iteration dependent step sizes (learning rates) and momentum parameters for the SHB that result in this fast convergence. Moreover, assuming only smoothness and convexity, we prove that the iterates of SHB converge \textit{almost surely} to a minimizer, and that the convergence of the function values of (S)HB is asymptotically faster than that of (S)GD in the overparametrized and in the deterministic settings. Our analysis is general, in that it includes all forms of mini-batching and non-uniform samplings as a special case, using an arbitrary sampling framework. Furthermore, our analysis does not rely on the bounded gradient assumptions. Instead, it only relies on smoothness, which is an assumption that can be more readily verified. Finally, we present extensive numerical experiments that show that our theoretically motivated parameter settings give a statistically significant faster convergence across a diverse collection of datasets.

artificial intelligence, machine learning, momentum 0, (15 more...)

2006.07867

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Leisure & Entertainment > Sports > Tennis (0.61)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Zhang, Guoqiang, Niwa, Kenta, Kleijn, W. Bastiaan

Rapidly Adapting Moment Estimation

arXiv.org Machine LearningFeb-24-2019

Adaptive gradient methods such as Adam have been shown to be very effective for training deep neural networks (DNNs) by tracking the second moment of gradients to compute the individual learning rates. Differently from existing methods, we make use of the most recent first moment of gradients to compute the individual learning rates per iteration. The motivation behind it is that the dynamic variation of the first moment of gradients may provide useful information to obtain the learning rates. We refer to the new method as the rapidly adapting moment estimation (RAME). The theoretical convergence of deterministic RAME is studied by using an analysis similar to the one used in [1] for Adam. Experimental results for training a number of DNNs show promising performance of RAME w.r.t. the convergence speed and generalization performance compared to the stochastic heavy-ball (SHB) method, Adam, and RMSprop.

artificial intelligence, machine learning, rame, (19 more...)

1902.0903

Country: Oceania (0.46)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

arXiv.org Machine LearningAug-30-2018

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

Yan, Yan, Yang, Tianbao, Li, Zhe, Lin, Qihang, Yang, Yi

Stochastic momentum methods have been widely adopted in training deep neural networks. However, their theoretical analysis of convergence of the training objective and the generalization error for prediction is still under-explored. This paper aims to bridge the gap between practice and theory by analyzing the stochastic gradient (SG) method, and the stochastic momentum methods including two famous variants, i.e., the stochastic heavy-ball (SHB) method and the stochastic variant of Nesterov's accelerated gradient (SNAG) method. We propose a framework that unifies the three variants. We then derive the convergence rates of the norm of gradient for the non-convex optimization problem, and analyze the generalization performance through the uniform stability approach. Particularly, the convergence analysis of the training objective exhibits that SHB and SNAG have no advantage over SG. However, the stability analysis shows that the momentum term can improve the stability of the learned model and hence improve the generalization performance. These theoretical insights verify the common wisdom and are also corroborated by our empirical analysis on deep learning.

artificial intelligence, generalization error, machine learning, (19 more...)

1808.10396

Country:

North America > United States > Iowa (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre:

Research Report (0.82)
Instructional Material (0.54)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)