momentum method
A geometric framework for momentum-based optimizers for low-rank training
Low-rank pre-training and fine-tuning have recently emerged as promising techniques for reducing the computational and storage costs of large neural networks. Training low-rank parameterizations typically relies on conventional optimizers such as heavy ball momentum methods or Adam. In this work, we identify and analyze potential difficulties that these training methods encounter when used to train low-rank parameterizations of weights. In particular, we show that classical momentum methods can struggle to converge to a local optimum due to the geometry of the underlying optimization landscape. To address this, we introduce novel training strategies derived from dynamical low-rank approximation, which explicitly account for the underlying geometric structure. Our approach leverages and combines tools from dynamical low-rank approximation and momentum-based optimization to design optimizers that respect the intrinsic geometry of the parameter space. We validate our methods through numerical experiments, demonstrating faster convergence, and stronger validation metrics at given parameter budgets.
In Search of Adam's Secret Sauce
Understanding the remarkable efficacy of Adam when training transformer-based language models has become a central research topic within the optimization community. To gain deeper insights, several simplifications of Adam have been proposed, such as the signed gradient and signed momentum methods. In this work, we conduct an extensive empirical study -- training over 1,500 language models across different data configurations and scales -- comparing Adam to several known simplified variants. We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam, even after careful tuning of momentum, clipping setting and learning rates. However, our analysis reveals a compelling option that preserves near-optimal performance while allowing for new insightful reformulations: constraining the Adam momentum parameters to be equal, $\beta_1=\beta_2$. Beyond robust performance, this choice affords new theoretical insights, highlights the secret sauce on top of signed momentum, and grants a precise statistical interpretation: we show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients--one that arises from a mean-field Gaussian variational inference perspective.
Gradient Descent Algorithm Survey
Fucheng, Deng, Wanjie, Wang, Ao, Gong, Xiaoqi, Wang, Fan, Wang
Its simple update, linear scalability with sample size, and compatibility with momentum, mini-batching, and learning-rate heuristics keep it dominant in both industry and academia. Current research continues to refine convergence rates, variance characterizations, and averaging schemes, while engineering efforts focus on hardware-aligned and distributed variants. B. Mini-Batch Stochastic Gradient Descent 1) Background and Development: Batch Gradient Descent (BGD) requires computing the gradient using the entire training dataset at each iteration. As dataset sizes expand to millions or even larger scales, the computational cost of a single iteration becomes extremely high, making it unsuitable for large-scale learning tasks. The convergence of SGD was proven by Robbins and Monro through the stochastic approximation method [1]. SGD uses one sample to update the gradient at each step, resulting in low computational cost but high gradient variance and unstable updates. The mini-batch strategy has gradually become the mainstream in practice, especially with the rise of large-scale machine learning and deep learning. Bottou emphasized the practical value of mini-batches in his research on large-scale learning [5], while systematic monographs and reviews on deep learning have further standardized this approach [6], [7]. Mini-batch SGD achieves an optimal balance between stability, high-frequency updates, and GPU parallel acceleration [2].
Schedulers for Schedule-free: Theoretically inspired hyperparameters
Pun, Yuen-Man, Buchholz, Matthew, Gower, Robert M.
The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of $\mathcal{O}(1/\sqrt{T})$. We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.
Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis
Lyu, Bochen, Zhang, Xiaojing, Zheng, Fangyi, Wang, He, Wang, Zheng, Zhu, Zhanxing
This paper establishes a continuous time approximation, a piece-wise continuous differential equation, for the discrete Heavy-Ball (HB) momentum method with explicit discretization error. Investigating continuous differential equations has been a promising approach for studying the discrete optimization methods. Despite the crucial role of momentum in gradient-based optimization methods, the gap between the original discrete dynamics and the continuous time approximations due to the discretization error has not been comprehensively bridged yet. In this work, we study the HB momentum method in continuous time while putting more focus on the discretization error to provide additional theoretical tools to this area. In particular, we design a first-order piece-wise continuous differential equation, where we add a number of counter terms to account for the discretization error explicitly. As a result, we provide a continuous time model for the HB momentum method that allows the control of discretization error to arbitrary order of the step size. As an application, we leverage it to find a new implicit regularization of the directional smoothness and investigate the implicit bias of HB for diagonal linear networks, indicating how our results can be used in deep learning. Our theoretical findings are further supported by numerical experiments.
Reviews: Understanding the Role of Momentum in Stochastic Gradient Methods
INDIVIDUAL COMMENTS / QUESTIONS 1) I really appreciate how the paper ties up loose ends by unifying the analysis of several momentum-based methods in the stochastic setting. I am not very closely familiar with the literature analyzing momentum methods, but there's a lot of work out there (e.g., the line of research studying momentum methods in the continuous time limit). A brief review would be very helpful to position the paper within the existing work. To me this implies that the analysis would go through for more general functions. I don't find it obvious that it would.
On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
Li, Xianliang, Luo, Jun, Zheng, Zhiwei, Wang, Hanxiao, Luo, Li, Wen, Lingkun, Wu, Linlong, Xu, Sheng
Momentum-based optimizers are widely adopted for training neural networks. However, the optimal selection of momentum coefficients remains elusive. This uncertainty impedes a clear understanding of the role of momentum in stochastic gradient methods. In this paper, we present a frequency domain analysis framework that interprets the momentum method as a time-variant filter for gradients, where adjustments to momentum coefficients modify the filter characteristics. Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance generalization performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers.
A Novel Momentum-Based Deep Learning Techniques for Medical Image Classification and Segmentation
Biswas, Koushik, Pal, Ridal, Patel, Shaswat, Jha, Debesh, Karri, Meghana, Reza, Amit, Durak, Gorkem, Medetalibeyoglu, Alpay, Antalek, Matthew, Velichko, Yury, Ladner, Daniela, Borhani, Amir, Bagci, Ulas
Accurately segmenting different organs from medical images is a critical prerequisite for computer-assisted diagnosis and intervention planning. This study proposes a deep learning-based approach for segmenting various organs from CT and MRI scans and classifying diseases. Our study introduces a novel technique integrating momentum within residual blocks for enhanced training dynamics in medical image analysis. We applied our method in two distinct tasks: segmenting liver, lung, & colon data and classifying abdominal pelvic CT and MRI scans. The proposed approach has shown promising results, outperforming state-of-the-art methods on publicly available benchmarking datasets. For instance, in the lung segmentation dataset, our approach yielded significant enhancements over the TransNetR model, including a 5.72% increase in dice score, a 5.04% improvement in mean Intersection over Union (mIoU), an 8.02% improvement in recall, and a 4.42% improvement in precision. Hence, incorporating momentum led to state-of-the-art performance in both segmentation and classification tasks, representing a significant advancement in the field of medical imaging.