Gradient Descent
Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models
Ma, Tianren, Zhang, Mu, Wang, Yibing, Ye, Qixiang
Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.
Quantitative Convergence Analysis of Projected Stochastic Gradient Descent for Non-Convex Losses via the Goldstein Subdifferential
Zheng, Yuping, Lamperski, Andrew
Stochastic gradient descent (SGD) is the main algorithm behind a large body of work in machine learning. In many cases, constraints are enforced via projections, leading to projected stochastic gradient algorithms. In recent years, a large body of work has examined the convergence properties of projected SGD for non-convex losses in asymptotic and non-asymptotic settings. Strong quantitative guarantees are available for convergence measured via Moreau envelopes. However, these results cannot be compared directly with work on unconstrained SGD, since the Moreau envelope construction changes the gradient. Other common measures based on gradient mappings have the limitation that convergence can only be guaranteed if variance reduction methods, such as mini-batching, are employed. This paper presents an analysis of projected SGD for non-convex losses over compact convex sets. Convergence is measured via the distance of the gradient to the Goldstein subdifferential generated by the constraints. Our proposed convergence criterion directly reduces to commonly used criteria in the unconstrained case, and we obtain convergence without requiring variance reduction. We obtain results for data that are independent, identically distributed (IID) or satisfy mixing conditions ($L$-mixing). In these cases, we derive asymptotic convergence and $O(N^{-1/3})$ non-asymptotic bounds in expectation, where $N$ is the number of steps. In the case of IID sub-Gaussian data, we obtain almost-sure asymptotic convergence and high-probability non-asymptotic $O(N^{-1/5})$ bounds. In particular, these are the first non-asymptotic high-probability bounds for projected SGD with non-convex losses.
On the Global Convergence of (Fast) Incremental Expectation Maximization Methods
Belhal Karimi, Hoi-To Wai, Eric Moulines, Marc Lavielle
The EM algorithm is one of the most popular algorithm for inference in latent data models. The original formulation of the EM algorithm does not scale to large data set, because the whole data set is required at each iteration of the algorithm. To alleviate this problem, Neal and Hinton [1998] have proposed an incremental version of the EM (iEM) in which at each iteration the conditional expectation of the latent data (E-step) is updated only for a mini-batch of observations. Another approach has been proposed by Capp e and Moulines [2009] in which the E-step is replaced by a stochastic approximation step, closely related to stochastic gradient. In this paper, we analyze incremental and stochastic version of the EM algorithm as well as the variance reduced-version of [Chen et al., 2018] in a common unifying framework. We also introduce a new version incremental version, inspired by the SAGA algorithm by Defazio et al. [2014]. We establish non-asymptotic convergence bounds for global convergence. Numerical applications are presented in this article to illustrate our findings.
37f0e884fbad9667e38940169d0a3c95-Reviews.html
The optimal first-order algorithm of Nesterov has linear convergence for such problem but the constant depends on the square root of the condition number k. The authors consider the situation where one has access to the expensive full gradient of the objective as well as a cheap stochastic gradient oracle. They propose a hybrid algorithm which only requires O(log 1/eps) calls to the full gradient oracle (independent of the condition number) and O(k^2 log(1/eps)) calls to the cheaper stochastic gradient oracle -- as long as the condition number is not too big, this could be faster in theory. The main idea behind their algorithm(called Epoch Mixed Gradient Descent - EMGD) is to replace a full gradient step (called an epoch) with a fixed number O(k^2) of mixed gradient steps which use a combination of the full gradient (computed once for the epoch) and stochastic gradients (which vary within an epoch). By taking the average of the O(k^2) iterates within an epoch, they can show a constant decrease of the suboptimality *independent* of the condition number, which is why the number of required full gradient step computations (the number of epochs) is independent from the condition number. They provide a simple and complete self-contained proof of their convergence rate, but no experiment.
Export Reviews, Discussions, Author Feedback and Meta-Reviews
First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. Line 33: I don't think it is accurate to attribute the recent success of supervised neural nets on various applications to BP and dropout. Firstly, learning nets with gradient descent has been around a long time, and the key to its recent success has mostly been fast computers/GPUs, a wealth of labelled data, advances in understanding of how to make SGD work well (e.g. Techniques like dropout have also been useful in reducing overfitting, but are hardly the key missing ingredient to make these systems work well. Line 37: The claim that the lacklustre the results associated with unsupervised generative approaches is owed purely to their intractability issues is a strong and problematic one.