AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

Ma, Tianren, Zhang, Mu, Wang, Yibing, Ye, Qixiang

arXiv.org Artificial IntelligenceOct-6-2025

Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2510.0288

Country: North America (0.28)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Quantitative Convergence Analysis of Projected Stochastic Gradient Descent for Non-Convex Losses via the Goldstein Subdifferential

Zheng, Yuping, Lamperski, Andrew

arXiv.org Artificial IntelligenceOct-6-2025

Stochastic gradient descent (SGD) is the main algorithm behind a large body of work in machine learning. In many cases, constraints are enforced via projections, leading to projected stochastic gradient algorithms. In recent years, a large body of work has examined the convergence properties of projected SGD for non-convex losses in asymptotic and non-asymptotic settings. Strong quantitative guarantees are available for convergence measured via Moreau envelopes. However, these results cannot be compared directly with work on unconstrained SGD, since the Moreau envelope construction changes the gradient. Other common measures based on gradient mappings have the limitation that convergence can only be guaranteed if variance reduction methods, such as mini-batching, are employed. This paper presents an analysis of projected SGD for non-convex losses over compact convex sets. Convergence is measured via the distance of the gradient to the Goldstein subdifferential generated by the constraints. Our proposed convergence criterion directly reduces to commonly used criteria in the unconstrained case, and we obtain convergence without requiring variance reduction. We obtain results for data that are independent, identically distributed (IID) or satisfy mixing conditions ($L$-mixing). In these cases, we derive asymptotic convergence and $O(N^{-1/3})$ non-asymptotic bounds in expectation, where $N$ is the number of steps. In the case of IID sub-Gaussian data, we obtain almost-sure asymptotic convergence and high-probability non-asymptotic $O(N^{-1/5})$ bounds. In particular, these are the first non-asymptotic high-probability bounds for projected SGD with non-convex losses.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2510.02735

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.45)

Add feedback

a376802c0811f1b9088828288eb0d3f0-Paper.pdf

Neural Information Processing SystemsOct-3-2025, 18:18:46 GMT

artificial intelligence, compression, machine learning, (18 more...)

Neural Information Processing Systems

Country: North America > United States (0.28)

Industry: Information Technology > Security & Privacy (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.97)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.30)

Add feedback

On the Global Convergence of (Fast) Incremental Expectation Maximization Methods

Belhal Karimi, Hoi-To Wai, Eric Moulines, Marc Lavielle

Neural Information Processing SystemsOct-3-2025, 09:06:16 GMT

The EM algorithm is one of the most popular algorithm for inference in latent data models. The original formulation of the EM algorithm does not scale to large data set, because the whole data set is required at each iteration of the algorithm. To alleviate this problem, Neal and Hinton [1998] have proposed an incremental version of the EM (iEM) in which at each iteration the conditional expectation of the latent data (E-step) is updated only for a mini-batch of observations. Another approach has been proposed by Capp e and Moulines [2009] in which the E-step is replaced by a stochastic approximation step, closely related to stochastic gradient. In this paper, we analyze incremental and stochastic version of the EM algorithm as well as the variance reduced-version of [Chen et al., 2018] in a common unifying framework. We also introduce a new version incremental version, inspired by the SAGA algorithm by Defazio et al. [2014]. We establish non-asymptotic convergence bounds for global convergence. Numerical applications are presented in this article to illustrate our findings.

algorithm, convergence, fiem method, (14 more...)

Neural Information Processing Systems

Country:

Europe > France (0.04)
Asia > China > Hong Kong (0.04)
Oceania > New Zealand > North Island > Waikato (0.04)
(3 more...)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

37f0e884fbad9667e38940169d0a3c95-Reviews.html

Neural Information Processing SystemsOct-3-2025, 08:58:05 GMT

The optimal first-order algorithm of Nesterov has linear convergence for such problem but the constant depends on the square root of the condition number k. The authors consider the situation where one has access to the expensive full gradient of the objective as well as a cheap stochastic gradient oracle. They propose a hybrid algorithm which only requires O(log 1/eps) calls to the full gradient oracle (independent of the condition number) and O(k^2 log(1/eps)) calls to the cheaper stochastic gradient oracle -- as long as the condition number is not too big, this could be faster in theory. The main idea behind their algorithm(called Epoch Mixed Gradient Descent - EMGD) is to replace a full gradient step (called an epoch) with a fixed number O(k^2) of mixed gradient steps which use a combination of the full gradient (computed once for the epoch) and stochastic gradients (which vary within an epoch). By taking the average of the O(k^2) iterates within an epoch, they can show a constant decrease of the suboptimality *independent* of the condition number, which is why the number of required full gradient step computations (the number of epochs) is independent from the condition number. They provide a simple and complete self-contained proof of their convergence rate, but no experiment.

algorithm, condition number, gradient step, (14 more...)

Neural Information Processing Systems

Country: North America > United States > Nevada (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.70)

Add feedback

Stochastic Gradient Riemannian Langevin Dynamics on the Probability Simplex

Sam Patterson, Yee Whye Teh

Neural Information Processing SystemsOct-3-2025, 08:19:08 GMT

Neural Information Processing Systems http://nips.cc/

probability simplex, stochastic gradient riemannian langevin dynamic

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Understanding Approximate Fisher Information for Fast Convergence of Natural Gradient Descent in Wide Neural Networks

Neural Information Processing SystemsOct-3-2025, 08:08:43 GMT

The fast convergence holds in layer-wise approximations; for instance, in block diagonal approximation where each block corresponds to a layer as well as in block tri-diagonal and K-FAC approximations.

approximation, convergence, neural network, (15 more...)

Neural Information Processing Systems

Country:

North America > Canada (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.42)

Add feedback

Continuous-time Models for Stochastic Optimization Algorithms

Antonio Orvieto, Aurelien Lucchi

Neural Information Processing SystemsOct-3-2025, 07:43:49 GMT

We propose new continuous-time formulations for first-order stochastic optimization algorithms such as mini-batch gradient descent and variance-reduced methods.

algorithm, arxiv preprint arxiv, convergence, (12 more...)

Neural Information Processing Systems

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Middle East > Jordan (0.05)
North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.84)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.37)

Add feedback

Appendix

Neural Information Processing SystemsOct-3-2025, 07:24:01 GMT

This objective is amenable to minibatching. The variational posterior tracks the true posterior during gradient updates.

dataset, dun, prediction, (17 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing SystemsOct-3-2025, 05:41:06 GMT

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Line 33: I don't think it is accurate to attribute the recent success of supervised neural nets on various applications to BP and dropout. Firstly, learning nets with gradient descent has been around a long time, and the key to its recent success has mostly been fast computers/GPUs, a wealth of labelled data, advances in understanding of how to make SGD work well (e.g. Techniques like dropout have also been useful in reducing overfitting, but are hardly the key missing ingredient to make these systems work well. Line 37: The claim that the lacklustre the results associated with unsupervised generative approaches is owed purely to their intractability issues is a strong and problematic one.

function space, generative model, optimization, (14 more...)

Neural Information Processing Systems

Country: North America > Canada > Quebec > Montreal (0.04)

Genre: Overview (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback