Gradient Descent
The Measure of Deception: An Analysis of Data Forging in Machine Unlearning
Dixit, Rishabh, Hui, Yuan, Saab, Rayan
Motivated by privacy regulations and the need to mitigate the effects of harmful data, machine unlearning seeks to modify trained models so that they effectively ``forget'' designated data. A key challenge in verifying unlearning is forging -- adversarially crafting data that mimics the gradient of a target point, thereby creating the appearance of unlearning without actually removing information. To capture this phenomenon, we consider the collection of data points whose gradients approximate a target gradient within tolerance $ฮต$ -- which we call an $ฮต$-forging set -- and develop a framework for its analysis. For linear regression and one-layer neural networks, we show that the Lebesgue measure of this set is small. It scales on the order of $ฮต$, and when $ฮต$ is small enough, $ฮต^d$. More generally, under mild regularity assumptions, we prove that the forging set measure decays as $ฮต^{(d-r)/2}$, where $d$ is the data dimension and $r
AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates
Zhang, Minxin, Liu, Yuxuan, Schaeffer, Hayden
The recently proposed Muon optimizer updates weight matrices via orthogonalized momentum and has demonstrated strong empirical success in large language model training. However, it remains unclear how to determine the learning rates for such orthogonalized updates. AdaGrad, by contrast, is a widely used adaptive method that scales stochastic gradients by accumulated past gradients. We propose a new algorithm, AdaGO, which combines a norm-based AdaGrad-type stepsize with an orthogonalized update direction, bringing together the benefits of both approaches. Unlike other adaptive variants of Muon, AdaGO preserves the orthogonality of the update direction, which can be interpreted as a spectral descent direction, while adapting the stepsizes to the optimization landscape by scaling the direction with accumulated past gradient norms. The implementation of AdaGO requires only minimal modification to Muon, with a single additional scalar variable, the accumulated squared gradient norms, to be computed, making it computationally and memory efficient. Optimal theoretical convergence rates are established for nonconvex functions in both stochastic and deterministic settings under standard smoothness and unbiased bounded-variance noise assumptions. Empirical results on CIFAR-10 classification and function regression demonstrate that AdaGO outperforms Muon and Adam.
DCPO: Dynamic Clipping Policy Optimization
Yang, Shihui, Dou, Chengfeng, Guo, Peidong, Lu, Kai, Ju, Qiang, Deng, Fei, Xin, Rihui
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization(DCPO), which introduces a dynamic clipping strategy that adaptively adjusts clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing DAPO (36.7/31.6), GRPO (36.7/32.1) and GSPO (40.0/34.9) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5), DAPO (20.0/15.3) and GSPO (16.7/9.9). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO's effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.
Integrating Intermediate Layer Optimization and Projected Gradient Descent for Solving Inverse Problems with Diffusion Models
Zheng, Yang, Li, Wen, Liu, Zhaoqiang
Inverse problems (IPs) involve reconstructing signals from noisy observations. Recently, diffusion models (DMs) have emerged as a powerful framework for solving IPs, achieving remarkable reconstruction performance. However, existing DM-based methods frequently encounter issues such as heavy computational demands and suboptimal convergence. In this work, building upon the idea of the recent work DMPlug, we propose two novel methods, DMILO and DMILO-PGD, to address these challenges. Our first method, DMILO, employs intermediate layer optimization (ILO) to alleviate the memory burden inherent in DMPlug. Additionally, by introducing sparse deviations, we expand the range of DMs, enabling the exploration of underlying signals that may lie outside the range of the diffusion model. We further propose DMILO-PGD, which integrates ILO with projected gradient descent (PGD), thereby reducing the risk of suboptimal convergence. We provide an intuitive theoretical analysis of our approaches under appropriate conditions and validate their superiority through extensive experiments on diverse image datasets, encompassing both linear and nonlinear IPs. Our results demonstrate significant performance gains over state-of-the-art methods, highlighting the effectiveness of DMILO and DMILO-PGD in addressing common challenges in DM-based IP solvers.
Mo' Memory, Mo' Problems: Stream-Native Machine Unlearning
Machine unlearning work assumes a static, i.i.d training environment that doesn't truly exist. Modern ML pipelines need to learn, unlearn, and predict continuously on production streams of data. We translate batch unlearning to the online setting using notions of regret, sample complexity, and deletion capacity. We tighten regret bounds to a logarithmic $\mathcal{O}(\ln{T})$, a first for a certified unlearning algorithm. When fitted with an online variant of L-BFGS optimization, the algorithm achieves state of the art regret with a constant memory footprint. Such changes extend the lifespan of an ML model before expensive retraining, making for a more efficient unlearning process.
Asynchronous and Stochastic Distributed Resource Allocation
Li, Qiang, Yemini, Michal, Wai, Hoi-To
This work proposes and studies the distributed resource allocation problem in asynchronous and stochastic settings. We consider a distributed system with multiple workers and a coordinating server with heterogeneous computation and communication times. We explore an approximate stochastic primal-dual approach with the aim of 1) adhering to the resource budget constraints, 2) allowing for the asynchronicity between the workers and the server, and 3) relying on the locally available stochastic gradients. We analyze our Asynchronous stochastic Primal-Dual (Asyn-PD) algorithm and prove its convergence in the second moment to the saddle point solution of the approximate problem at the rate of $O(1/t)$, where $t$ is the iteration number. Furthermore, we verify our algorithm numerically to validate the analytically derived convergence results, and demonstrate the advantages of utilizing our asynchronous algorithm rather than deploying a synchronous algorithm where the server must wait until it gets update from all workers.
Federated learning over physical channels: adaptive algorithms with near-optimal guarantees
In federated learning, communication cost can be significantly reduced by transmitting the information over the air through physical channels. In this paper, we propose a new class of adaptive federated stochastic gradient descent (SGD) algorithms that can be implemented over physical channels, taking into account both channel noise and hardware constraints. We establish theoretical guarantees for the proposed algorithms, demonstrating convergence rates that are adaptive to the stochastic gradient noise level. We also demonstrate the practical effectiveness of our algorithms through simulation studies with deep learning models.
Globally aware optimization with resurgence
Modern optimization faces a fundamental challenge: local gradient-based methods provide no global information about the objective function $L$ landscape, often leading to suboptimal convergence and sensitivity to initialization. We introduce a novel optimization framework that leverages resurgence theory from complex analysis to extract global structural information from divergent asymptotic series. Our key insight is that the factorially divergent perturbative expansions of parameter space partition functions encode precise information about all critical objective function value in the landscape through their Borel transform singularities. The algorithm works by computing the statistical mechanical partition function $Z(g) = \int e^{-L(ฮธ)/g} dฮธ$ for small coupling $g\ll 1$, extracting its asymptotic series coefficients, and identifying Borel plane singularities that correspond one-to-one with critical objective function values. These target values provide global guidance to local optimizers, enabling principled learning rate adaptation and escape from suboptimal regions. Unlike heuristic adaptive methods, targets are theoretically grounded in the geometry of the optimization landscape.
An Evolutionary Multi-objective Optimization for Replica-Exchange-based Physics-informed Operator Learning Network
Lu, Binghang, Mou, Changhong, Lin, Guang
In this paper, we propose an evolutionary Multi-objective Optimization for Replica-Exchange-based Physics-informed Operator learning Network, which is a novel operator learning network to efficiently solve parametric partial differential equations. In forward and inverse settings, this operator learning network only admits minimum requirement of noisy observational data. While physics-informed neural networks and operator learning approaches such as Deep Operator Networks and Fourier Neural Operators offer promising alternatives to traditional numerical solvers, they struggle with balancing operator and physics losses, maintaining robustness under noisy or sparse data, and providing uncertainty quantification. The proposed framework addresses these limitations by integrating: (i) evolutionary multi-objective optimization to adaptively balance operator and physics-based losses in the Pareto front; (ii) replica exchange stochastic gradient Langevin dynamics to improve global parameter-space exploration and accelerate convergence; and (iii) built-in Bayesian uncertainty quantification from stochastic sampling. The proposed operator learning method is tested numerically on several different problems including one-dimensional Burgers equation and the time-fractional mixed diffusion-wave equation. The results indicate that our framework consistently outperforms the general operator learning methods in accuracy, noise robustness, and the ability to quantify uncertainty.
A Fluid Antenna Enabled Physical Layer Key Generation for Next-G Wireless Networks
Guo, Jiacheng, Gao, Ning, Zuo, Yiping, Xu, Hao, Jin, Shi, Wong, Kai Kit
As a promising physical layer security technique, physical layer key generation (PLKG) enables legitimate users to obtain secret keys from wireless channel without security infrastructures. However, in harsh propagation environments, the channel characteristic becomes unsatisfactory, the key generation rate (KGR) is significantly deteriorated. In this paper, we propose a novel fluid antenna (FA) enabled PLKG system to address this challenge. Specifically, we first derive the closed-form expression of the KGR for FA array, and then jointly optimize the precoding matrix and the antenna positions via a particle swarm optimization (PSO) algorithm. Next, to further reduce the computational complexity of the optimization procedure, we develop an alternating optimization (AO) algorithm, which combines the projected gradient descent (PGD) and the PSO. Simulation results demonstrate that by exploiting the additional spatial degree of freedom (DoF), our FA enabled PLKG system is superior to the benchmarks, such as the conventional fixed-position antenna (FPA) array and the reconfigurable intelligent surface (RIS). It is worth highlighting that compared to the conventional uniform planar antenna (UPA), the FA enabled PLKG achieves a 35.42\% KGR performance improvement under PSO algorithm and a 67.73\% KGR performance improvement under AO algorithm, respectively.