shb
Fixed-Budget Change Point Identification in Piecewise Constant Bandits
Lazzaro, Joseph, Pike-Burke, Ciara
We study the piecewise constant bandit problem where the expected reward is a piecewise constant function with one change point (discontinuity) across the action space $[0,1]$ and the learner's aim is to locate the change point. Under the assumption of a fixed exploration budget, we provide the first non-asymptotic analysis of policies designed to locate abrupt changes in the mean reward function under bandit feedback. We study the problem under a large and small budget regime, and for both settings establish lower bounds on the error probability and provide algorithms with near matching upper bounds. Interestingly, our results show a separation in the complexity of the two regimes. We then propose a regime adaptive algorithm which is near optimal for both small and large budgets simultaneously. We complement our theoretical analysis with experimental results in simulated environments to support our findings.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.67)
Stochastic Polyak Step-sizes and Momentum: Convergence Guarantees and Practical Performance
Oikonomou, Dimitris, Loizou, Nicolas
Stochastic gradient descent with momentum, also known as Stochastic Heavy Ball method (SHB), is one of the most popular algorithms for solving large-scale stochastic optimization problems in various machine learning tasks. In practical scenarios, tuning the step-size and momentum parameters of the method is a prohibitively expensive and time-consuming process. In this work, inspired by the recent advantages of stochastic Polyak step-size in the performance of stochastic gradient descent (SGD), we propose and explore new Polyak-type variants suitable for the update rule of the SHB method. In particular, using the Iterate Moving Average (IMA) viewpoint of SHB, we propose and analyze three novel step-size selections: MomSPS$_{\max}$, MomDecSPS, and MomAdaSPS. For MomSPS$_{\max}$, we provide convergence guarantees for SHB to a neighborhood of the solution for convex and smooth problems (without assuming interpolation). If interpolation is also satisfied, then using MomSPS$_{\max}$, SHB converges to the true solution at a fast rate matching the deterministic HB. The other two variants, MomDecSPS and MomAdaSPS, are the first adaptive step-sizes for SHB that guarantee convergence to the exact minimizer without prior knowledge of the problem parameters and without assuming interpolation. The convergence analysis of SHB is tight and obtains the convergence guarantees of SGD with stochastic Polyak step-sizes as a special case. We supplement our analysis with experiments that validate the theory and demonstrate the effectiveness and robustness of the new algorithms.
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > New York (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
Role of Momentum in Smoothing Objective Function in Implicit Graduated Optimization
Although convergence analyses of SGD with momentum for nonconvex functions have been provided While stochastic gradient descent (SGD) with momentum [11, 16, 39], none of them explain why convergence has fast convergence and excellent generalizability, a is faster than with SGD. The generalizability of SGD with theoretical explanation for this is lacking. In this paper, momentum has been well studied, and various experimental we show that SGD with momentum smooths the objective findings have been reported. While it has been suggested function, the degree of which is determined by the learning that momentum plays a role in reducing stochastic noise rate, the batch size, the momentum factor, the variance of [8, 11], stochastic noise has been shown to increase generalizability the stochastic gradient, and the upper bound of the gradient [18, 37, 59], and it has been claimed that stochastic norm. This theoretical finding reveals why momentum improves noise can help an algorithm escape from local solutions generalizability and provides new insights into the with poor generalizability [10, 15, 20, 28, 32].
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.76)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Noise-adaptive (Accelerated) Stochastic Heavy-Ball Momentum
Dang, Anh, Babanezhad, Reza, Vaswani, Sharan
We analyze the convergence of stochastic heavy ball (SHB) momentum in the smooth, strongly-convex setting. Kidambi et al. (2018) show that SHB (with small mini-batches) cannot attain an accelerated rate of convergence even for quadratics, and conjecture that the practical gain of SHB is a by-product of mini-batching. We substantiate this claim by showing that SHB can obtain an accelerated rate when the mini-batch size is larger than some threshold. In particular, for strongly-convex quadratics with condition number $\kappa$, we prove that SHB with the standard step-size and momentum parameters results in an $O\left(\exp(-\frac{T}{\sqrt{\kappa}}) + \sigma \right)$ convergence rate, where $T$ is the number of iterations and $\sigma^2$ is the variance in the stochastic gradients. To ensure convergence to the minimizer, we propose a multi-stage approach that results in a noise-adaptive $O\left(\exp\left(-\frac{T}{\sqrt{\kappa}} \right) + \frac{\sigma}{T}\right)$ rate. For general strongly-convex functions, we use the averaging interpretation of SHB along with exponential step-sizes to prove an $O\left(\exp\left(-\frac{T}{\kappa} \right) + \frac{\sigma^2}{T} \right)$ convergence to the minimizer in a noise-adaptive manner. Finally, we empirically demonstrate the effectiveness of the proposed algorithms.
Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise
Pan, Rui, Liu, Yuxing, Wang, Xiaoyu, Zhang, Tong
Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing deep learning models. In contrast to its empirical popularity, the understanding of its theoretical property is still quite limited, especially under the standard anisotropic gradient noise condition for quadratic regression problems. Although it is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings, there is no rigorous theoretical analysis. In this paper, we fill this theoretical gap by establishing a non-asymptotic convergence bound for stochastic heavy-ball methods with step decay scheduler on quadratic objectives, under the anisotropic gradient noise condition. As a direct implication, we show that heavy-ball momentum can provide $\tilde{\mathcal{O}}(\sqrt{\kappa})$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate with respect to the stochastic variance term. The combined effect implies an overall convergence rate within log factors from the statistical minimax rate. This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning, where a smaller number of iterations can significantly reduce the number of communication rounds, leading to acceleration in practice.
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Asia > Middle East > Jordan (0.04)
- (3 more...)
Almost Sure Saddle Avoidance of Stochastic Gradient Methods without the Bounded Gradient Assumption
We prove that various stochastic gradient descent methods, including the stochastic gradient descent (SGD), stochastic heavy-ball (SHB), and stochastic Nesterov's accelerated gradient (SNAG) methods, almost surely avoid any strict saddle manifold. To the best of our knowledge, this is the first time such results are obtained for SHB and SNAG methods. Moreover, our analysis expands upon previous studies on SGD by removing the need for bounded gradients of the objective function and uniformly bounded noise. Instead, we introduce a more practical local boundedness assumption for the noisy gradient, which is naturally satisfied in empirical risk minimization problems typically seen in training of neural networks. Keywords: Stochastic gradient descent, stochastic heavy-ball, stochastic Nesterov's accelerated gradient, almost sure saddle avoidance
- Asia > Middle East > Jordan (0.05)
- North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence
Wu, Diyuan, Kungurtsev, Vyacheslav, Mondelli, Marco
The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum. To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.
- North America > United States (0.14)
- Asia > Middle East > Jordan (0.04)
- Europe > Russia (0.04)
- (2 more...)
On the convergence of the Stochastic Heavy Ball Method
Sebbouh, Othmane, Gower, Robert M., Defazio, Aaron
We provide a comprehensive analysis of the Stochastic Heavy Ball (SHB) method (otherwise known as the momentum method), including a convergence of the last iterate of SHB, establishing a faster rate of convergence than existing bounds on the last iterate of Stochastic Gradient Descent (SGD) in the convex setting. Our analysis shows that unlike SGD, no final iterate averaging is necessary with the SHB method. We detail new iteration dependent step sizes (learning rates) and momentum parameters for the SHB that result in this fast convergence. Moreover, assuming only smoothness and convexity, we prove that the iterates of SHB converge \textit{almost surely} to a minimizer, and that the convergence of the function values of (S)HB is asymptotically faster than that of (S)GD in the overparametrized and in the deterministic settings. Our analysis is general, in that it includes all forms of mini-batching and non-uniform samplings as a special case, using an arbitrary sampling framework. Furthermore, our analysis does not rely on the bounded gradient assumptions. Instead, it only relies on smoothness, which is an assumption that can be more readily verified. Finally, we present extensive numerical experiments that show that our theoretically motivated parameter settings give a statistically significant faster convergence across a diverse collection of datasets.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Asia > Japan > Kyūshū & Okinawa > Okinawa (0.04)
Rapidly Adapting Moment Estimation
Zhang, Guoqiang, Niwa, Kenta, Kleijn, W. Bastiaan
Adaptive gradient methods such as Adam have been shown to be very effective for training deep neural networks (DNNs) by tracking the second moment of gradients to compute the individual learning rates. Differently from existing methods, we make use of the most recent first moment of gradients to compute the individual learning rates per iteration. The motivation behind it is that the dynamic variation of the first moment of gradients may provide useful information to obtain the learning rates. We refer to the new method as the rapidly adapting moment estimation (RAME). The theoretical convergence of deterministic RAME is studied by using an analysis similar to the one used in [1] for Adam. Experimental results for training a number of DNNs show promising performance of RAME w.r.t. the convergence speed and generalization performance compared to the stochastic heavy-ball (SHB) method, Adam, and RMSprop.
- Asia > Japan (0.04)
- Oceania > New Zealand > North Island > Wellington Region > Wellington (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (3 more...)
A Unified Analysis of Stochastic Momentum Methods for Deep Learning
Yan, Yan, Yang, Tianbao, Li, Zhe, Lin, Qihang, Yang, Yi
Stochastic momentum methods have been widely adopted in training deep neural networks. However, their theoretical analysis of convergence of the training objective and the generalization error for prediction is still under-explored. This paper aims to bridge the gap between practice and theory by analyzing the stochastic gradient (SG) method, and the stochastic momentum methods including two famous variants, i.e., the stochastic heavy-ball (SHB) method and the stochastic variant of Nesterov's accelerated gradient (SNAG) method. We propose a framework that unifies the three variants. We then derive the convergence rates of the norm of gradient for the non-convex optimization problem, and analyze the generalization performance through the uniform stability approach. Particularly, the convergence analysis of the training objective exhibits that SHB and SNAG have no advantage over SG. However, the stability analysis shows that the momentum term can improve the stability of the learned model and hence improve the generalization performance. These theoretical insights verify the common wisdom and are also corroborated by our empirical analysis on deep learning.
- North America > United States > Iowa (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Research Report (0.82)
- Instructional Material (0.54)