Gradient Descent
Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence
The Kullback-Leibler (KL) divergence plays a central role in probabilistic machine learning, where it commonly serves as the canonical loss function. Optimization in such settings is often performed over the probability simplex, where the choice of parameterization significantly impacts convergence. In this work, we study the problem of minimizing the KL divergence and analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems within the framework of information geometry$-$ the exponential family ($θ$ coordinates) and the mixture family ($η$ coordinates). We compare Euclidean gradient descent (GD) in these coordinates with the coordinate-invariant natural gradient descent (NGD), where the natural gradient is a Riemannian gradient that incorporates the intrinsic geometry of the underlying statistical model. In continuous time, we prove that the convergence rates of GD in the $θ$ and $η$ coordinates provide lower and upper bounds, respectively, on the convergence rate of NGD. Moreover, under affine reparameterizations of the dual coordinates, the convergence rates of GD in $η$ and $θ$ coordinates can be scaled to $2c$ and $\frac{2}{c}$, respectively, for any $c>0$, while NGD maintains a fixed convergence rate of $2$, remaining invariant to such transformations and sandwiched between them. Although this suggests that NGD may not exhibit uniformly superior convergence in continuous time, we demonstrate that its advantages become pronounced in discrete time, where it achieves faster convergence and greater robustness to noise, outperforming GD. Our analysis hinges on bounding the spectrum and condition number of the Hessian of the KL divergence at the optimum, which coincides with the Fisher information matrix.
Feed-anywhere ANN (I) Steady Discrete $\to$ Diffusing on Graph Hidden States
Pasechnyuk-Vilensky, Dmitry, Doroshenko, Daniil
We propose a novel framework for learning hidden graph structures from data using geometric analysis and nonlinear dynamics. Our approach: (1) Defines discrete Sobolev spaces on graphs for scalar/vector fields, establishing key functional properties; (2) Introduces gauge-equivalent nonlinear Schrödinger and Landau--Lifshitz dynamics with provable stable stationary solutions smoothly dependent on input data and graph weights; (3) Develops a stochastic gradient algorithm over graph moduli spaces with sparsity regularization. Theoretically, we guarantee: topological correctness (homology recovery), metric convergence (Gromov--Hausdorff), and efficient search space utilization. Our dynamics-based model achieves stronger generalization bounds than standard neural networks, with complexity dependent on the data manifold's topology.
Statistical Inference for Differentially Private Stochastic Gradient Descent
Xia, Xintao, Zhang, Linjun, Cai, Zhanrui
Privacy preservation in machine learning, particularly through Differentially Private Stochastic Gradient Descent (DP-SGD), is critical for sensitive data analysis. However, existing statistical inference methods for SGD predominantly focus on cyclic subsampling, while DP-SGD requires randomized subsampling. This paper first bridges this gap by establishing the asymptotic properties of SGD under the randomized rule and extending these results to DP-SGD. For the output of DP-SGD, we show that the asymptotic variance decomposes into statistical, sampling, and privacy-induced components. Two methods are proposed for constructing valid confidence intervals: the plug-in method and the random scaling method. We also perform extensive numerical analysis, which shows that the proposed confidence intervals achieve nominal coverage rates while maintaining privacy.
Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime
Attia, Amit, Schliserman, Matan, Sherman, Uri, Koren, Tomer
We study population convergence guarantees of stochastic gradient descent (SGD) for smooth convex objectives in the interpolation regime, where the noise at optimum is zero or near zero. The behavior of the last iterate of SGD in this setting -- particularly with large (constant) stepsizes -- has received growing attention in recent years due to implications for the training of over-parameterized models, as well as to analyzing forgetting in continual learning and to understanding the convergence of the randomized Kaczmarz method for solving linear systems. We establish that after $T$ steps of SGD on $β$-smooth convex loss functions with stepsize $0 < η< 2/β$, the last iterate exhibits expected excess risk $\widetilde{O}(\frac{1}{η(2-βη) T^{1-βη/2}} + \fracη{(2-βη)^2} T^{βη/2} σ_\star^2)$, where $σ_\star^2$ denotes the variance of the stochastic gradients at the optimum. In particular, for a well-tuned stepsize we obtain a near optimal $\widetilde{O}(1/T + σ_\star/\sqrt{T})$ rate for the last iterate, extending the results of Varre et al. (2021) beyond least squares regression; and when $σ_\star=0$ we obtain a rate of $\smash{O(1/\sqrt T)}$ with $η=1/β$, improving upon the best-known $\smash{O(T^{-1/4})}$ rate recently established by Evron et al. (2025) in the special case of realizable linear regression.
Who Owns This Sample: Cross-Client Membership Inference Attack in Federated Graph Neural Networks
Li, Kunhao, Wu, Di, Bai, Jun, Xu, Jing, Yang, Lei, Zhang, Ziyi, Song, Yiliao, Yang, Wencheng, Cai, Taotao, Li, Yan
Graph-structured data is prevalent in many real-world applications, including social networks, financial systems, and molecular biology. Graph Neural Networks (GNNs) have become the de facto standard for learning from such data due to their strong representation capabilities. As GNNs are increasingly deployed in federated learning (FL) settings to preserve data locality and privacy, new privacy threats arise from the interaction between graph structures and decentralized training. In this paper, we present the first systematic study of cross-client membership inference attacks (CC-MIA) against node classification tasks of federated GNNs (FedGNNs), where a malicious client aims to infer which client owns the given data. Unlike prior centralized-focused work that focuses on whether a sample was included in training, our attack targets sample-to-client attribution, a finer-grained privacy risk unique to federated settings. We design a general attack framework that exploits FedGNNs' aggregation behaviors, gradient updates, and embedding proximity to link samples to their source clients across training rounds. We evaluate our attack across multiple graph datasets under realistic FL setups. Results show that our method achieves high performance on both membership inference and ownership identification. Our findings highlight a new privacy threat in federated graph learning-client identity leakage through structural and model-level cues, motivating the need for attribution-robust GNN design.
On the Convergence of Gradient Descent on Learning Transformers with Residual Connections
Qin, Zhen, Zhou, Jinxin, Zhu, Zhihui
Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components--such as self-attention mechanisms and feedforward networks--without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.
MAP Estimation with Denoisers: Convergence Rates and Guarantees
Pesme, Scott, Meanti, Giacomo, Arbel, Michael, Mairal, Julien
Denoiser models have become powerful tools for inverse problems, enabling the use of pretrained networks to approximate the score of a smoothed prior distribution. These models are often used in heuristic iterative schemes aimed at solving Maximum a Posteriori (MAP) optimisation problems, where the proximal operator of the negative log-prior plays a central role. In practice, this operator is intractable, and practitioners plug in a pretrained denoiser as a surrogate-despite the lack of general theoretical justification for this substitution. In this work, we show that a simple algorithm, closely related to several used in practice, provably converges to the proximal operator under a log-concavity assumption on the prior $p$. We show that this algorithm can be interpreted as a gradient descent on smoothed proximal objectives. Our analysis thus provides a theoretical foundation for a class of empirically successful but previously heuristic methods.
Statistical and Algorithmic Foundations of Reinforcement Learning
Chi, Yuejie, Chen, Yuxin, Wei, Yuting
As a paradigm for sequential decision making in unknown environments, reinforcement learning (RL) has received a flurry of attention in recent years. However, the explosion of model complexity in emerging applications and the presence of nonconvexity exacerbate the challenge of achieving efficient RL in sample-starved situations, where data collection is expensive, time-consuming, or even high-stakes (e.g., in clinical trials, autonomous systems, and online advertising). How to understand and enhance the sample and computational efficacies of RL algorithms is thus of great interest. In this tutorial, we aim to introduce several important algorithmic and theoretical developments in RL, highlighting the connections between new ideas and classical topics. Employing Markov Decision Processes as the central mathematical model, we cover several distinctive RL scenarios (i.e., RL with a simulator, online RL, offline RL, robust RL, and RL with human feedback), and present several mainstream RL approaches (i.e., model-based approach, value-based approach, and policy optimization). Our discussions gravitate around the issues of sample complexity, computational efficiency, as well as algorithm-dependent and information-theoretic lower bounds from a non-asymptotic viewpoint.
Branching Stein Variational Gradient Descent for sampling multimodal distributions
Bañales, Isaías, Jaramillo, Arturo, Ricalde-Guerrero, Joshué Helí
We propose a novel particle-based variational inference method designed to work with multimodal distributions. Our approach, referred to as Branched Stein Variational Gradient Descent (BSVGD), extends the classical Stein Variational Gradient Descent (SVGD) algorithm by incorporating a random branching mechanism that encourages the exploration of the state space. In this work, a theoretical guarantee for the convergence in distribution is presented, as well as numerical experiments to validate the suitability of our algorithm. Performance comparisons between the BSVGD and the SVGD are presented using the Wasserstein distance between samples and the corresponding computational times.
On the Effectiveness of the z-Transform Method in Quadratic Optimization
Characterizing the convergence of real-valued or vector-v alued sequences is a key theoretical problem in data science, where the sequence index typically correspon ds to the number of iterations of an iterative algorithm (such as in optimization and signal processing) o r the number of observations (as in statistics and machine learning). This characterization can be done in mostly two ways, asymptotically or non-asymptotically. In an asymptotic analysis, an asymptotic e quivalent of the sequence is identified, which readily allows comparisons with other algorithms; however, without further analysis, the behavior at any finite time cannot be controlled. This is exactly what non-as ymptotic analysis aims to achieve, by providing bounds that are valid even for a finite index, but then only pro viding bounds that cannot always be compared. While the two approaches have their own merits, in this paper, we focus on asymptotic analysis and sequences that tend to their limit at a sub-exponential r ate that is a power of the sequence index. The main goal of this paper is to show how a classical tool from signal processing, control theory, and electrical engineering ( Oppenheim et al., 1996), the z -transform method ( Jury, 1964), can be used in this context with a striking efficiency at obtaining asymptotic eq uivalents for the class of algorithms that can be seen as iterations of potentially random linear operators i n a Hilbert space. This includes gradient descent for quadratic optimization problems as well as its accelera ted and stochastic variants ( Nesterov, 2018), 1 Landweber iterations in inverse problems ( Benning and Burger, 2018), or gossip algorithms in distributed computing ( Boyd et al., 2006).