AITopics | momentum method

Collaborating Authors

momentum method

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A geometric framework for momentum-based optimizers for low-rank training

Neural Information Processing SystemsJun-13-2026, 01:26:58 GMT

Low-rank pre-training and fine-tuning have recently emerged as promising techniques for reducing the computational and storage costs of large neural networks. Training low-rank parameterizations typically relies on conventional optimizers such as heavy ball momentum methods or Adam. In this work, we identify and analyze potential difficulties that these training methods encounter when used to train low-rank parameterizations of weights. In particular, we show that classical momentum methods can struggle to converge to a local optimum due to the geometry of the underlying optimization landscape. To address this, we introduce novel training strategies derived from dynamical low-rank approximation, which explicitly account for the underlying geometric structure. Our approach leverages and combines tools from dynamical low-rank approximation and momentum-based optimization to design optimizers that respect the intrinsic geometry of the parameter space. We validate our methods through numerical experiments, demonstrating faster convergence, and stronger validation metrics at given parameter budgets.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.61)

Add feedback

In Search of Adam's Secret Sauce

Neural Information Processing SystemsJun-12-2026, 09:02:25 GMT

Understanding the remarkable efficacy of Adam when training transformer-based language models has become a central research topic within the optimization community. To gain deeper insights, several simplifications of Adam have been proposed, such as the signed gradient and signed momentum methods. In this work, we conduct an extensive empirical study -- training over 1,500 language models across different data configurations and scales -- comparing Adam to several known simplified variants. We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam, even after careful tuning of momentum, clipping setting and learning rates. However, our analysis reveals a compelling option that preserves near-optimal performance while allowing for new insightful reformulations: constraining the Adam momentum parameters to be equal, $\beta_1=\beta_2$. Beyond robust performance, this choice affords new theoretical insights, highlights the secret sauce on top of signed momentum, and grants a precise statistical interpretation: we show that Adam in this setting implements a natural online algorithm for estimating the mean and variance of gradients--one that arises from a mean-field Gaussian variational inference perspective.

artificial intelligence, machine learning, natural language, (10 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.84)

Add feedback

Gradient Descent Algorithm Survey

Fucheng, Deng, Wanjie, Wang, Ao, Gong, Xiaoqi, Wang, Fan, Wang

arXiv.org Artificial IntelligenceNov-27-2025

Its simple update, linear scalability with sample size, and compatibility with momentum, mini-batching, and learning-rate heuristics keep it dominant in both industry and academia. Current research continues to refine convergence rates, variance characterizations, and averaging schemes, while engineering efforts focus on hardware-aligned and distributed variants. B. Mini-Batch Stochastic Gradient Descent 1) Background and Development: Batch Gradient Descent (BGD) requires computing the gradient using the entire training dataset at each iteration. As dataset sizes expand to millions or even larger scales, the computational cost of a single iteration becomes extremely high, making it unsuitable for large-scale learning tasks. The convergence of SGD was proven by Robbins and Monro through the stochastic approximation method [1]. SGD uses one sample to update the gradient at each step, resulting in low computational cost but high gradient variance and unstable updates. The mini-batch strategy has gradually become the mainstream in practice, especially with the rise of large-scale machine learning and deep learning. Bottou emphasized the practical value of mini-batches in his research on large-scale learning [5], while systematic monographs and reviews on deep learning have further standardized this approach [6], [7]. Mini-batch SGD achieves an optimal balance between stability, high-frequency updates, and GPU parallel acceleration [2].

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.20725

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Schedulers for Schedule-free: Theoretically inspired hyperparameters

Pun, Yuen-Man, Buchholz, Matthew, Gower, Robert M.

arXiv.org Artificial IntelligenceNov-12-2025

The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of $\mathcal{O}(1/\sqrt{T})$. We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.

artificial intelligence, convergence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2511.07767

Genre: Research Report (0.40)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis

Lyu, Bochen, Zhang, Xiaojing, Zheng, Fangyi, Wang, He, Wang, Zheng, Zhu, Zhanxing

arXiv.org Artificial IntelligenceOct-23-2025

This paper establishes a continuous time approximation, a piece-wise continuous differential equation, for the discrete Heavy-Ball (HB) momentum method with explicit discretization error. Investigating continuous differential equations has been a promising approach for studying the discrete optimization methods. Despite the crucial role of momentum in gradient-based optimization methods, the gap between the original discrete dynamics and the continuous time approximations due to the discretization error has not been comprehensively bridged yet. In this work, we study the HB momentum method in continuous time while putting more focus on the discretization error to provide additional theoretical tools to this area. In particular, we design a first-order piece-wise continuous differential equation, where we add a number of counter terms to account for the discretization error explicitly. As a result, we provide a continuous time model for the HB momentum method that allows the control of discretization error to arbitrary order of the step size. As an application, we leverage it to find a new implicit regularization of the directional smoothness and investigate the implicit bias of HB for diagonal linear networks, indicating how our results can be used in deep learning. Our theoretical findings are further supported by numerical experiments.

artificial intelligence, machine learning, optimization problem, (20 more...)

arXiv.org Artificial Intelligence

2506.14806

Country: North America > United States (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry: Leisure & Entertainment > Sports > Tennis (0.60)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Reviews: Understanding the Role of Momentum in Stochastic Gradient Methods

Neural Information Processing SystemsJan-23-2025, 14:13:17 GMT

INDIVIDUAL COMMENTS / QUESTIONS 1) I really appreciate how the paper ties up loose ends by unifying the analysis of several momentum-based methods in the stochastic setting. I am not very closely familiar with the literature analyzing momentum methods, but there's a lot of work out there (e.g., the line of research studying momentum methods in the continuous time limit). A brief review would be very helpful to position the paper within the existing work. To me this implies that the analysis would go through for more general functions. I don't find it obvious that it would.

literature review, momentum-based method, stochastic gradient method, (2 more...)

Neural Information Processing Systems

Genre: Overview (0.41)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

Li, Xianliang, Luo, Jun, Zheng, Zhiwei, Wang, Hanxiao, Luo, Li, Wen, Lingkun, Wu, Linlong, Xu, Sheng

arXiv.org Artificial IntelligenceNov-29-2024

Momentum-based optimizers are widely adopted for training neural networks. However, the optimal selection of momentum coefficients remains elusive. This uncertainty impedes a clear understanding of the role of momentum in stochastic gradient methods. In this paper, we present a frequency domain analysis framework that interprets the momentum method as a time-variant filter for gradients, where adjustments to momentum coefficients modify the filter characteristics. Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance generalization performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers.

artificial intelligence, machine learning, momentum system, (18 more...)

arXiv.org Artificial Intelligence

2411.19671

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > China > Guangdong Province > Shenzhen (0.04)
North America > United States > Georgia > Fulton County > Atlanta (0.04)
(5 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)

Add feedback

A Novel Momentum-Based Deep Learning Techniques for Medical Image Classification and Segmentation

Biswas, Koushik, Pal, Ridal, Patel, Shaswat, Jha, Debesh, Karri, Meghana, Reza, Amit, Durak, Gorkem, Medetalibeyoglu, Alpay, Antalek, Matthew, Velichko, Yury, Ladner, Daniela, Borhani, Amir, Bagci, Ulas

arXiv.org Artificial IntelligenceAug-11-2024

Accurately segmenting different organs from medical images is a critical prerequisite for computer-assisted diagnosis and intervention planning. This study proposes a deep learning-based approach for segmenting various organs from CT and MRI scans and classifying diseases. Our study introduces a novel technique integrating momentum within residual blocks for enhanced training dynamics in medical image analysis. We applied our method in two distinct tasks: segmenting liver, lung, & colon data and classifying abdominal pelvic CT and MRI scans. The proposed approach has shown promising results, outperforming state-of-the-art methods on publicly available benchmarking datasets. For instance, in the lung segmentation dataset, our approach yielded significant enhancements over the TransNetR model, including a 5.72% increase in dice score, a 5.04% improvement in mean Intersection over Union (mIoU), an 8.02% improvement in recall, and a 4.42% improvement in precision. Hence, incorporating momentum led to state-of-the-art performance in both segmentation and classification tasks, representing a significant advancement in the field of medical imaging.

classification, dataset, segmentation, (15 more...)

arXiv.org Artificial Intelligence

2408.05692

Country: