Gradient Descent
On the Theory of Continual Learning with Gradient Descent for Neural Networks
Taheri, Hossein, Ghosh, Avishek, Mazumdar, Arya
Gradient-based methods are the primary approach for training ne ural networks. In recent years, research in learning theory has shown that neural networks can efficiently lea rn various data classes using empirical risk minimization (ERM) methods. In many real-world settings, data a rrive sequentially in a non-stationary manner, requiring the learner to maintain performance on past tas ks while acquiring new capabilities. In such cases, a learning model must be continually learnable, meaning it should retain previously acquired knowledge when trained on new tasks. On the other hand, various le arning systems, including deep learning architectures, can be prone to catastrophic forgetting, that is, updating a model on new data causes a dramatic drop in performance on previously learned tasks [ McCloskey and Cohen, 1989, Goodfellow et al., 2013 ]. The goal of continual (lifelong) learning is to develop models and methods that, even without retraining on old data, experience minimal forgetting when incorporating new inform ation. Despite deep learning's ubiquity, characterizing the power and limitat ions of neural networks is still an ongoing research direction. While several recent works aim to unde rstand the power of gradient descent (GD) for training neural networks with stylized data distributions, these works are still limited to single-task scenarios (for some examples see [ Du et al., 2019, Bartlett et al., 2021, Abbe et al., 2022 ]).
Correlating Cross-Iteration Noise for DP-SGD using Model Curvature
Gu, Xin, Xiao, Yingtai, He, Guanlin, Bai, Jiamu, Kifer, Daniel, Maeng, Kiwan
Differentially private stochastic gradient descent (DP-SGD) offers the promise of training deep learning models while mitigating many privacy risks. However, there is currently a large accuracy gap between DP-SGD and normal SGD training. This has resulted in different lines of research investigating orthogonal ways of improving privacy-preserving training. One such line of work, known as DP-MF, correlates the privacy noise across different iterations of stochastic gradient descent - allowing later iterations to cancel out some of the noise added to earlier iterations. In this paper, we study how to improve this noise correlation. We propose a technique called Noise-Curve that uses model curvature, estimated from public unlabeled data, to improve the quality of this cross-iteration noise correlation. Our experiments on various datasets, models, and privacy parameters show that the noise correlations computed by NoiseCurve offer consistent and significant improvements in accuracy over the correlation scheme used by DP-MF. Differential privacy (DP) (Dwork et al., 2006b) is a rigorous mathematical framework that limits the amount of personal information an attacker can infer from the output of an algorithm that processes confidential data. Differentially private stochastic gradient descent (DP-SGD, (Abadi et al., 2016)) is one of the most popular methods for training machine learning (ML) models with DP guarantees. DP-SGD differs from standard SGD in two important ways.
Simultaneous Learning and Optimization via Misspecified Saddle Point Problems
Ahmadi, Mohammad Mahdi, Hamedani, Erfan Yazdandoost
We study a class of misspecified saddle point (SP) problems, where the optimization objective depends on an unknown parameter that must be learned concurrently from data. Unlike existing studies that assume parameters are fully known or pre-estimated, our framework integrates optimization and learning into a unified formulation, enabling a more flexible problem class. To address this setting, we propose two algorithms based on the accelerated primal-dual (APD) by Hamedani & Aybat 2021. In particular, we first analyze the naive extension of the APD method by directly substituting the evolving parameter estimates into the primal-dual updates; then, we design a new learning-aware variant of the APD method that explicitly accounts for parameter dynamics by adjusting the momentum updates. Both methods achieve a provable convergence rate of $\mathcal{O}(\log K / K)$, while the learning-aware approach attains a tighter $\mathcal{O}(1)$ constant and further benefits from an adaptive step-size selection enabled by a backtracking strategy. Furthermore, we extend the framework to problems where the learning problem admits multiple optimal solutions, showing that our modified algorithm for a structured setting achieves an $\mathcal{O}(1/\sqrt{K})$ rate. To demonstrate practical impact, we evaluate our methods on a misspecified portfolio optimization problem and show superior empirical performance compared to state-of-the-art algorithms.
MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization
Han, Yichen, Han, Yuhang, Liu, Bojun, Zhou, Zhengpeng, Liu, Guanyu, Zhang, Zeng, Yang, Yang, Wang, Wenli, Shi, Isaac N, Zhang, Yunyan, He, Lewei, Shi, Tianyu
Prompt engineering is crucial for fully leveraging large language models (LLMs), yet most existing optimization methods follow a single trajectory, resulting in limited adaptability, gradient conflicts, and high computational overhead. We propose MAPGD (Multi-Agent Prompt Gradient Descent), a novel framework that reconceptualizes prompt optimization as a collaborative process among specialized agents. Each agent focuses on a distinct refinement dimension, such as instruction clarity, example selection, format structure, or stylistic adaptation, and their contributions are coordinated through semantic gradient embedding, conflict detection, and fusion. To further enhance robustness and stability, MAPGD introduces two new mechanisms: Hypersphere Constrained Gradient Clustering (HCGC), which enforces angular margin constraints for compact and well-separated clusters, and Channel Adaptive Agent Weighting (CAAW), which dynamically reweights agent contributions based on validation performance. Experiments on classification and reasoning benchmarks show that MAPGD consistently surpasses single-agent and random baselines in both accuracy and efficiency. Ablation studies confirm the effectiveness of gradient fusion, agent specialization, and conflict resolution. Together, these components establish MAPGD as a unified, gradient-based, and interpretable framework for robust prompt optimization with theoretical convergence guarantees.
Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
Xiong, Wei, Ye, Chenlu, Liao, Baohao, Dong, Hanze, Xu, Xinxing, Monz, Christof, Bian, Jiang, Jiang, Nan, Zhang, Tong
Reinforcement learning applied to large language models (LLMs) for reasoning tasks is often bottlenecked by unstable gradient estimates due to fixed and uniform sampling of responses across prompts. Prior work such as GVM-RAFT addresses this by dynamically allocating inference budget per prompt to minimize stochastic gradient variance under a budget constraint. Inspired by this insight, we propose Reinforce-Ada, an adaptive sampling framework for online RL post-training of LLMs that continuously reallocates sampling effort to the prompts with the greatest uncertainty or learning potential. Unlike conventional two-stage allocation methods, Reinforce-Ada interleaves estimation and sampling in an online successive elimination process, and automatically stops sampling for a prompt once sufficient signal is collected. To stabilize updates, we form fixed-size groups with enforced reward diversity and compute advantage baselines using global statistics aggregated over the adaptive sampling phase. Empirical results across multiple model architectures and reasoning benchmarks show that Reinforce-Ada accelerates convergence and improves final performance compared to GRPO, especially when using the balanced sampling variant. Our work highlights the central role of variance-aware, adaptive data curation in enabling efficient and reliable reinforcement learning for reasoning-capable LLMs.
Closed-Form Last Layer Optimization
Galashov, Alexandre, Da Costa, Nathaël, Xu, Liyuan, Hennig, Philipp, Gretton, Arthur
Neural networks are typically optimized with variants of stochastic gradient descent. Under a squared loss, however, the optimal solution to the linear last layer weights is known in closed-form. We propose to leverage this during optimization, treating the last layer as a function of the backbone parameters, and optimizing solely for these parameters. We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer. We adapt the method for the setting of stochastic gradient descent, by trading off the loss on the current batch against the accumulated information from previous batches. Further, we prove that, in the Neural Tangent Kernel regime, convergence of this method to an optimal solution is guaranteed. Finally, we demonstrate the effectiveness of our approach compared with standard SGD on a squared loss in several supervised tasks -- both regression and classification -- including Fourier Neural Operators and Instrumental Variable Regression.