Gradient Descent
Neuro-Argumentative Learning with Case-Based Reasoning
We introduce Gradual Abstract Argumentation for Case-Based Reasoning (Gradual AA-CBR), a data-driven, neurosymbolic classification model in which the outcome is determined by an argumentation debate structure that is learned simultaneously with neural-based feature extractors. Each argument in the debate is an observed case from the training data, favouring their labelling. Cases attack or support those with opposing or agreeing labellings, with the strength of each argument and relationship learned through gradient-based methods. This argumentation debate structure provides human-aligned reasoning, improving model interpretability compared to traditional neural networks (NNs). Unlike the existing purely symbolic variant, Abstract Argumentation for Case-Based Reasoning (AA-CBR), Gradual AA-CBR is capable of multi-class classification, automatic learning of feature and data point importance, assigning uncertainty values to outcomes, using all available data points, and does not require binary features. We show that Gradual AA-CBR performs comparably to NNs whilst significantly outperforming existing AA-CBR formulations.
Mechanistic Insights into Grokking from the Embedding Layer
AlquBoj, H. V., AlQuabeh, Hilal, Bojkovic, Velibor, Nwadike, Munachiso, Inui, Kentaro
Grokking, a delayed generalization in neural networks after perfect training performance, has been observed in Transformers and MLPs, but the components driving it remain underexplored. We show that embeddings are central to grokking: introducing them into MLPs induces delayed generalization in modular arithmetic tasks, whereas MLPs without embeddings can generalize immediately. Our analysis identifies two key mechanisms: (1) Embedding update dynamics, where rare tokens stagnate due to sparse gradient updates and weight decay, and (2) Bilinear coupling, where the interaction between embeddings and downstream weights introduces saddle points and increases sensitivity to initialization. To confirm these mechanisms, we investigate frequency-aware sampling, which balances token updates by minimizing gradient variance, and embedding-specific learning rates, derived from the asymmetric curvature of the bilinear loss landscape. We prove that an adaptive learning rate ratio, \(\frac{η_E}{η_W} \propto \frac{σ_{\max}(E)}{σ_{\max}(W)} \cdot \frac{f_W}{f_E}\), mitigates bilinear coupling effects, accelerating convergence. Our methods not only improve grokking dynamics but also extend to broader challenges in Transformer optimization, where bilinear interactions hinder efficient training.
AM-PPO: (Advantage) Alpha-Modulation with Proximal Policy Optimization
Proximal Policy Optimization (PPO) is a widely used reinforcement learning algorithm that heavily relies on accurate advantage estimates for stable and efficient training. However, raw advantage signals can exhibit significant variance, noise, and scale-related issues, impeding optimal learning performance. To address this challenge, we introduce Advantage Modulation PPO (AM-PPO), a novel enhancement of PPO that adaptively modulates advantage estimates using a dynamic, non-linear scaling mechanism. This adaptive modulation employs an alpha controller that dynamically adjusts the scaling factor based on evolving statistical properties of the advantage signals, such as their norm, variance, and a predefined target saturation level. By incorporating a tanh-based gating function driven by these adaptively scaled advantages, AM-PPO reshapes the advantage signals to stabilize gradient updates and improve the conditioning of the policy gradient landscape. Crucially, this modulation also influences value function training by providing consistent and adaptively conditioned learning targets. Empirical evaluations across standard continuous control benchmarks demonstrate that AM-PPO achieves superior reward trajectories, exhibits sustained learning progression, and significantly reduces the clipping required by adaptive optimizers. These findings underscore the potential of advantage modulation as a broadly applicable technique for enhancing reinforcement learning optimization.
A Unified Gradient-based Framework for Task-agnostic Continual Learning-Unlearning
Huang, Zhehao, Cheng, Xinwen, Zhang, Jie, Zheng, Jinghao, Wang, Haoran, He, Zhengbao, Li, Tao, Huang, Xiaolin
Recent advancements in deep models have highlighted the need for intelligent systems that combine continual learning (CL) for knowledge acquisition with machine unlearning (MU) for data removal, forming the Continual Learning-Unlearning (CLU) paradigm. While existing work treats CL and MU as separate processes, we reveal their intrinsic connection through a unified optimization framework based on Kullback-Leibler divergence minimization. This framework decomposes gradient updates for approximate CLU into four components: learning new knowledge, unlearning targeted data, preserving existing knowledge, and modulation via weight saliency. A critical challenge lies in balancing knowledge update and retention during sequential learning-unlearning cycles. To resolve this stability-plasticity dilemma, we introduce a remain-preserved manifold constraint to induce a remaining Hessian compensation for CLU iterations. A fast-slow weight adaptation mechanism is designed to efficiently approximate the second-order optimization direction, combined with adaptive weighting coefficients and a balanced weight saliency mask, proposing a unified implementation framework for gradient-based CLU. Furthermore, we pioneer task-agnostic CLU scenarios that support fine-grained unlearning at the cross-task category and random sample levels beyond the traditional task-aware setups. Experiments demonstrate that the proposed UG-CLU framework effectively coordinates incremental learning, precise unlearning, and knowledge stability across multiple datasets and model architectures, providing a theoretical foundation and methodological support for dynamic, compliant intelligent systems.
KO: Kinetics-inspired Neural Optimizer with PDE Simulation Approaches
Feng, Mingquan, Huang, Yixin, Fu, Yifan, Wang, Shaobo, Yan, Junchi
The design of optimization algorithms for neural networks remains a critical challenge, with most existing methods relying on heuristic adaptations of gradient-based approaches. This paper introduces KO (Kinetics-inspired Optimizer), a novel neural optimizer inspired by kinetic theory and partial differential equation (PDE) simulations. We reimagine the training dynamics of network parameters as the evolution of a particle system governed by kinetic principles, where parameter updates are simulated via a numerical scheme for the Boltzmann transport equation (BTE) that models stochastic particle collisions. This physics-driven approach inherently promotes parameter diversity during optimization, mitigating the phenomenon of parameter condensation, i.e. collapse of network parameters into low-dimensional subspaces, through mechanisms analogous to thermal diffusion in physical systems. We analyze this property, establishing both a mathematical proof and a physical interpretation. Extensive experiments on image classification (CIFAR-10/100, ImageNet) and text classification (IMDB, Snips) tasks demonstrate that KO consistently outperforms baseline optimizers (e.g., Adam, SGD), achieving accuracy improvements while computation cost remains comparable.
From stability of Langevin diffusion to convergence of proximal MCMC for non-log-concave sampling
Renaud, Marien, De Bortoli, Valentin, Leclaire, Arthur, Papadakis, Nicolas
We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity. In many context, e.g. imaging inverse problems, potentials are non-convex and non-smooth. Proximal Stochastic Gradient Langevin Algorithm (PSGLA) is a popular algorithm to handle such potentials. It combines the forward-backward optimization algorithm with a ULA step. Our main stability result combined with properties of the Moreau envelope allows us to derive the first proof of convergence of the PSGLA for non-convex potentials. We empirically validate our methodology on synthetic data and in the context of imaging inverse problems. In particular, we observe that PSGLA exhibits faster convergence rates than Stochastic Gradient Langevin Algorithm for posterior sampling while preserving its restoration properties.
Private Statistical Estimation via Truncation
Zampetakis, Manolis, Zhou, Felix
We introduce a novel framework for differentially private (DP) statistical estimation via data truncation, addressing a key challenge in DP estimation when the data support is unbounded. Traditional approaches rely on problem-specific sensitivity analysis, limiting their applicability. By leveraging techniques from truncated statistics, we develop computationally efficient DP estimators for exponential family distributions, including Gaussian mean and covariance estimation, achieving near-optimal sample complexity. Previous works on exponential families only consider bounded or one-dimensional families. Our approach mitigates sensitivity through truncation while carefully correcting for the introduced bias using maximum likelihood estimation and DP stochastic gradient descent. Along the way, we establish improved uniform convergence guarantees for the log-likelihood function of exponential families, which may be of independent interest. Our results provide a general blueprint for DP algorithm design via truncated statistics.
Reconstructing Physics-Informed Machine Learning for Traffic Flow Modeling: a Multi-Gradient Descent and Pareto Learning Approach
Lei, Yuan-Zheng, Gong, Yaobang, Chen, Dianwei, Cheng, Yao, Yang, Xianfeng Terry
Physics-informed machine learning (PIML) is crucial in modern traffic flow modeling because it combines the benefits of both physics-based and data-driven approaches. In conventional PIML, physical information is typically incorporated by constructing a hybrid loss function that combines data-driven loss and physics loss through linear scalarization. The goal is to find a trade-off between these two objectives to improve the accuracy of model predictions. However, from a mathematical perspective, linear scalarization is limited to identifying only the convex region of the Pareto front, as it treats data-driven and physics losses as separate objectives. Given that most PIML loss functions are non-convex, linear scalarization restricts the achievable trade-off solutions. Moreover, tuning the weighting coefficients for the two loss components can be both time-consuming and computationally challenging. To address these limitations, this paper introduces a paradigm shift in PIML by reformulating the training process as a multi-objective optimization problem, treating data-driven loss and physics loss independently. We apply several multi-gradient descent algorithms (MGDAs), including traditional multi-gradient descent (TMGD) and dual cone gradient descent (DCGD), to explore the Pareto front in this multi-objective setting. These methods are evaluated on both macroscopic and microscopic traffic flow models. In the macroscopic case, MGDAs achieved comparable performance to traditional linear scalarization methods. Notably, in the microscopic case, MGDAs significantly outperformed their scalarization-based counterparts, demonstrating the advantages of a multi-objective optimization approach in complex PIML scenarios.
Smoothed SGD for quantiles: Bahadur representation and Gaussian approximation
Chen, Likai, Keilbar, Georg, Wu, Wei Biao
This paper considers the estimation of quantiles via a smoothed version of the stochastic gradient descent (SGD) algorithm. By smoothing the score function in the conventional SGD quantile algorithm, we achieve monotonicity in the quantile level in that the estimated quantile curves do not cross. We derive non-asymptotic tail probability bounds for the smoothed SGD quantile estimate both for the case with and without Polyak-Ruppert averaging. For the latter, we also provide a uniform Bahadur representation and a resulting Gaussian approximation result. Numerical studies show good finite sample behavior for our theoretical results.
A Local Polyak-Lojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models
Xu, Ziqing, Min, Hancheng, Tarmoun, Salma, Mallada, Enrique, Vidal, Rene
Most prior work on the convergence of gradient descent (GD) for overparameterized neural networks relies on strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (large, spectral, balanced). Recent efforts to relax these assumptions focus on two-layer linear networks trained with the squared loss. In this work, we derive a linear convergence rate for training two-layer linear neural networks with GD for general losses and under relaxed assumptions on the step size, width, and initialization. A key challenge in deriving this result is that classical ingredients for deriving convergence rates for nonconvex problems, such as the Polyak-Łojasiewicz (PL) condition and Descent Lemma, do not hold globally for overparameterized neural networks. Here, we prove that these two conditions hold locally with local constants that depend on the weights. Then, we provide bounds on these local constants, which depend on the initialization of the weights, the current loss, and the global PL and smoothness constants of the non-overparameterized model. Based on these bounds, we derive a linear convergence rate for GD. Our convergence analysis not only improves upon prior results but also suggests a better choice for the step size, as verified through our numerical experiments.