AITopics

Manikandan, Meenakshi, Gilpin, Leilani

Explore the Loss space with Hill-ADAM

arXiv.org Machine LearningOct-7-2025

This paper introduces Hill-ADAM. Hill-ADAM is an optimizer with its focus towards escaping local minima in prescribed loss landscapes to find the global minimum. Hill-ADAM escapes minima by deterministically exploring the state space. This eliminates uncertainty from random gradient updates in stochastic algorithms while seldom converging at the first minimum that visits. In the paper we first derive an analytical approximation of the ADAM Optimizer step size at a particular model state. From there define the primary condition determining ADAM limitations in escaping local minima. The proposed optimizer algorithm Hill-ADAM alternates between error minimization and maximization. It maximizes to escape the local minimum and minimizes again afterward. This alternation provides an overall exploration throughout the loss space. This allows the deduction of the global minimum's state. Hill-ADAM was tested with 5 loss functions and 12 amber-saturated to cooler-shade image color correction instances.

gradient, hill-adam, loss function, (16 more...)

arXiv.org Machine Learning

2510.03613

Country:

North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Pasand, Ali Saheb, Dohmatob, Elvis

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

Grokking is the phenomenon whereby, unlike the training performance, which peaks early in the training process, the test/generalization performance of a model stagnates over arbitrarily many epochs and then suddenly jumps to usually close to perfect levels. In practice, it is desirable to reduce the length of such plateaus, that is to make the learning process "grok" faster. In this work, we provide new insights into grokking. First, we show both empirically and theoretically that grokking can be induced by asymmetric speeds of (stochastic) gradient descent, along different principal (i.e singular directions) of the gradients. We then propose a simple modification that normalizes the gradients so that dynamics along all the principal directions evolves at exactly the same speed. Then, we establish that this modified method, which we call egalitarian gradient descent (EGD) and can be seen as a carefully modified form of natural gradient descent, groks much faster. In fact, in some cases the stagnation is completely removed. Finally, we empirically show that on classical arithmetic problems such as modular addition and sparse parity problem which this stagnation has been widely observed and intensively studied, that our proposed method eliminates the plateaus.

artificial intelligence, arxivpreprintarxiv, machine learning, (18 more...)

2510.0493

Genre: Research Report (0.42)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.97)

Deng, Jiaxin, Pang, Junbiao

Adaptively Sampling-Reusing-Mixing Decomposed Gradients to Speed Up Sharpness Aware Minimization

Sharpness-Aware Minimization (SAM) improves model generalization but doubles the computational cost of Stochastic Gradient Descent (SGD) by requiring twice the gradient calculations per optimization step. To mitigate this, we propose Adaptively sampling-Reusing-mixing decomposed gradients to significantly accelerate SAM (ARSAM). Concretely, we firstly discover that SAM's gradient can be decomposed into the SGD gradient and the Projection of the Second-order gradient onto the First-order gradient (PSF). Furthermore, we observe that the SGD gradient and PSF dynamically evolve during training, emphasizing the growing role of the PSF to achieve a flat minima. Therefore, ARSAM is proposed to the reused PSF and the timely updated PSF still maintain the model's generalization ability. Extensive experiments show that ARSAM achieves state-of-the-art accuracies comparable to SAM across diverse network architectures. On CIFAR-10/100, ARSAM is comparable to SAM while providing a speedup of about 40\%. Moreover, ARSAM accelerates optimization for the various challenge tasks (\textit{e.g.}, human pose estimation, and model quantization) without sacrificing performance, demonstrating its broad practicality.% The code is publicly accessible at: https://github.com/ajiaaa/ARSAM.

artificial intelligence, deep learning, machine learning, (16 more...)

2510.03763

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Vision (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Liu, Anji, Shao, Zilei, Broeck, Guy Van den

Rethinking Probabilistic Circuit Parameter Learning

Probabilistic Circuits (PCs) offer a computationally scalable framework for generative modeling, supporting exact and efficient inference of a wide range of probabilistic queries. While recent advances have significantly improved the expressiveness and scalability of PCs, effectively training their parameters remains a challenge. In particular, a widely used optimization method, full-batch Expectation-Maximization (EM), requires processing the entire dataset before performing a single update, making it ineffective for large datasets. Although empirical extensions to the mini-batch setting, as well as gradient-based mini-batch algorithms, converge faster than full-batch EM, they generally underperform in terms of final likelihood. We investigate this gap by establishing a novel theoretical connection between these practical algorithms and the general EM objective. Our analysis reveals a fundamental issue that existing mini-batch EM and gradient-based methods fail to properly regularize distribution changes, causing each update to effectively ``overfit'' the current mini-batch. Motivated by this insight, we introduce anemone, a new mini-batch EM algorithm for PCs. Anemone applies an implicit adaptive learning rate to each parameter, scaled by how much it contributes to the likelihood of the current batch. Across extensive experiments on language, image, and DNA datasets, anemone consistently outperforms existing optimizers in both convergence speed and final performance.

artificial intelligence, machine learning, node, (15 more...)

2505.19982

Country:

North America > United States > California (0.28)
Europe (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)

Stochastic Approximation Methods for Distortion Risk Measure Optimization

Jiang, Jinyang, Heidergott, Bernd, Hu, Jiaqiao, Peng, Yijie

Distortion Risk Measures (DRMs) capture risk preferences in decision-making and serve as general criteria for managing uncertainty. This paper proposes gradient descent algorithms for DRM optimization based on two dual representations: the Distortion-Measure (DM) form and Quantile-Function (QF) form. The DM-form employs a three-timescale algorithm to track quantiles, compute their gradients, and update decision variables, utilizing the Generalized Likelihood Ratio and kernel-based density estimation. The QF-form provides a simpler two-timescale approach that avoids the need for complex quantile gradient estimation. A hybrid form integrates both approaches, applying the DM-form for robust performance around distortion function jumps and the QF-form for efficiency in smooth regions. Proofs of strong convergence and convergence rates for the proposed algorithms are provided. In particular, the DM-form achieves an optimal rate of $O(k^{-4/7})$, while the QF-form attains a faster rate of $O(k^{-2/3})$. Numerical experiments confirm their effectiveness and demonstrate substantial improvements over baselines in robust portfolio selection tasks. The method's scalability is further illustrated through integration into deep reinforcement learning. Specifically, a DRM-based Proximal Policy Optimization algorithm is developed and applied to multi-echelon dynamic inventory management, showcasing its practical applicability.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

2510.04563

Country: North America > United States > New York (0.28)

Genre: Research Report (1.00)

Industry: Information Technology (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.47)

Categorical Invariants of Learning Dynamics

Tamim, Abdulrahman

Neural network training is typically viewed as gradient descent on a loss surface. We propose a fundamentally different perspective: learning is a structure-preserving transformation (a functor L) between the space of network parameters (Param) and the space of learned representations (Rep). This categorical framework reveals that different training runs producing similar test performance often belong to the same homotopy class (continuous deformation family) of optimization paths. We show experimentally that networks converging via homotopic trajectories generalize within 0.5% accuracy of each other, while non-homotopic paths differ by over 3%. The theory provides practical tools: persistent homology identifies stable minima predictive of generalization (R^2 = 0.82 correlation), pullback constructions formalize transfer learning, and 2-categorical structures explain when different optimization algorithms yield functionally equivalent models. These categorical invariants offer both theoretical insight into why deep learning works and concrete algorithmic principles for training more robust networks.

artificial intelligence, machine learning, representation, (19 more...)

2510.04376

Genre: Research Report (0.40)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

REG: A Regularization Optimizer for Robust Training Dynamics

Liu, Zehua, Wu, Han, Fu, Xiaojin, Liu, Shuqi, Han, Xiongwei, Zhong, Tao, Yuan, Mingxuan

Optimizers are crucial for the efficient training of Large Language Models (LLMs). While AdamW is the de facto standard, recent structure-aware optimizers like Muon have emerged, which regularize gradient updates by operating on entire weight matrices. The Muon optimizer balances the gradient updates along all the directions. However, Muon's reliance on the matrix sign function can lead to training instability, exhibits incompatibility when fine-tuning models pre-trained with AdamW. To address these limitations, we propose \textbf{REG}, a novel optimizer that replaces Muon's aggressive matrix sign operator with the Row-and-Column-Scaling (RACS) operator. Theoretically grounded in balancing a matrix, the RACS operator regularizes the update steps in a less drastic manner, making it simpler to implement and more compatible with established training dynamics. Through extensive empirical experiments on LLM training, we demonstrate that our REG optimizer not only achieves superior performance and stability over AdamW, but also maintains consistency with the AdamW training paradigm. This consistency is particularly evident during the fine-tuning stage, where REG optimizer avoids the performance degradation observed with Muon.

large language model, machine learning, natural language, (21 more...)

2510.03691

Country: Asia (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Patitucci, Francisco, Jiang, Ruichen, Mokhtari, Aryan

Improving Online-to-Nonconvex Conversion for Smooth Optimization via Double Optimism

A recent breakthrough in nonconvex optimization is the online-to-nonconvex conversion framework of [Cutkosky et al., 2023], which reformulates the task of finding an $\varepsilon$-first-order stationary point as an online learning problem. When both the gradient and the Hessian are Lipschitz continuous, instantiating this framework with two different online learners achieves a complexity of $O(\varepsilon^{-1.75}\log(1/\varepsilon))$ in the deterministic case and a complexity of $O(\varepsilon^{-3.5})$ in the stochastic case. However, this approach suffers from several limitations: (i) the deterministic method relies on a complex double-loop scheme that solves a fixed-point equation to construct hint vectors for an optimistic online learner, introducing an extra logarithmic factor; (ii) the stochastic method assumes a bounded second-order moment of the stochastic gradient, which is stronger than standard variance bounds; and (iii) different online learning algorithms are used in the two settings. In this paper, we address these issues by introducing an online optimistic gradient method based on a novel doubly optimistic hint function. Specifically, we use the gradient at an extrapolated point as the hint, motivated by two optimistic assumptions: that the difference between the hint and the target gradient remains near constant, and that consecutive update directions change slowly due to smoothness. Our method eliminates the need for a double loop and removes the logarithmic factor. Furthermore, by simply replacing full gradients with stochastic gradients and under the standard assumption that their variance is bounded by $σ^2$, we obtain a unified algorithm with complexity $O(\varepsilon^{-1.75} + σ^2 \varepsilon^{-3.5})$, smoothly interpolating between the best-known deterministic rate and the optimal stochastic rate.

artificial intelligence, gradient, machine learning, (17 more...)

2510.03167

Country: North America > United States > Texas (0.28)

Genre: Research Report (0.63)

Industry: Education (0.76)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.69)

Liang, Shuang, Montúfar, Guido

Gradient Descent with Large Step Sizes: Chaos and Fractal Convergence Region

arXiv.org Machine LearningOct-6-2025

We examine gradient descent in matrix factorization and show that under large step sizes the parameter space develops a fractal structure. We derive the exact critical step size for convergence in scalar-vector factorization and show that near criticality the selected minimizer depends sensitively on the initialization. Moreover, we show that adding regularization amplifies this sensitivity, generating a fractal boundary between initializations that converge and those that diverge. The analysis extends to general matrix factorization with orthogonal initialization. Our findings reveal that near-critical step sizes induce a chaotic regime of gradient descent where the long-term dynamics are unpredictable and there are no simple implicit biases, such as towards balancedness, minimum norm, or flatness.

converge, gradient descent, initialization, (11 more...)

arXiv.org Machine Learning

2509.25351

Country:

North America > United States > Indiana (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.87)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)