Gradient Descent
Efficient Deconvolution in Populational Inverse Problems
Vadeboncoeur, Arnaud, Girolami, Mark, Stuart, Andrew M.
This work is focussed on the inversion task of inferring the distribution over parameters of interest leading to multiple sets of observations. The potential to solve such distributional inversion problems is driven by increasing availability of data, but a major roadblock is blind deconvolution, arising when the observational noise distribution is unknown. However, when data originates from collections of physical systems, a population, it is possible to leverage this information to perform deconvolution. To this end, we propose a methodology leveraging large data sets of observations, collected from different instantiations of the same physical processes, to simultaneously deconvolve the data corrupting noise distribution, and to identify the distribution over model parameters defining the physical processes. A parameter-dependent mathematical model of the physical process is employed. A loss function characterizing the match between the observed data and the output of the mathematical model is defined; it is minimized as a function of the both the parameter inputs to the model of the physics and the parameterized observational noise. This coupled problem is addressed with a modified gradient descent algorithm that leverages specific structure in the noise model. Furthermore, a new active learning scheme is proposed, based on adaptive empirical measures, to train a surrogate model to be accurate in parameter regions of interest; this approach accelerates computation and enables automatic differentiation of black-box, potentially nondifferentiable, code computing parameter-to-solution maps. The proposed methodology is demonstrated on porous medium flow, damped elastodynamics, and simplified models of atmospheric dynamics.
Multiple Wasserstein Gradient Descent Algorithm for Multi-Objective Distributional Optimization
Nguyen, Dai Hai, Mamitsuka, Hiroshi, Nakamura, Atsuyoshi
We address the optimization problem of simultaneously minimizing multiple objective functionals over a family of probability distributions. This type of Multi-Objective Distributional Optimization commonly arises in machine learning and statistics, with applications in areas such as multiple target sampling, multi-task learning, and multi-objective generative modeling. To solve this problem, we propose an iterative particle-based algorithm, which we call Muliple Wasserstein Gradient Descent (MWGraD), which constructs a flow of intermediate empirical distributions, each being represented by a set of particles, which gradually minimize the multiple objective functionals simultaneously. Specifically, MWGraD consists of two key steps at each iteration. First, it estimates the Wasserstein gradient for each objective functional based on the current particles. Then, it aggregates these gradients into a single Wasserstein gradient using dynamically adjusted weights and updates the particles accordingly. In addition, we provide theoretical analysis and present experimental results on both synthetic and real-world datasets, demonstrating the effectiveness of MWGraD.
Convergence, Sticking and Escape: Stochastic Dynamics Near Critical Points in SGD
Dudukalov, Dmitry, Logachov, Artem, Lotov, Vladimir, Prasolov, Timofei, Prokopenko, Evgeny, Tarasenko, Anton
We study the convergence properties and escape dynamics of Stochastic Gradient Descent (SGD) in one-dimensional landscapes, separately considering infinite- and finite-variance noise. Our main focus is to identify the time scales on which SGD reliably moves from an initial point to the local minimum in the same ''basin''. Under suitable conditions on the noise distribution, we prove that SGD converges to the basin's minimum unless the initial point lies too close to a local maximum. In that near-maximum scenario, we show that SGD can linger for a long time in its neighborhood. For initial points near a ''sharp'' maximum, we show that SGD does not remain stuck there, and we provide results to estimate the probability that it will reach each of the two neighboring minima. Overall, our findings present a nuanced view of SGD's transitions between local maxima and minima, influenced by both noise characteristics and the underlying function geometry.
Scaling Laws for Gradient Descent and Sign Descent for Linear Bigram Models under Zipf's Law
Kunstner, Frederik, Bach, Francis
Recent works have highlighted optimization difficulties faced by gradient descent in training the first and last layers of transformer-based language models, which are overcome by optimizers such as Adam. These works suggest that the difficulty is linked to the heavy-tailed distribution of words in text data, where the frequency of the $k$th most frequent word $π_k$ is proportional to $1/k$, following Zipf's law. To better understand the impact of the data distribution on training performance, we study a linear bigram model for next-token prediction when the tokens follow a power law $π_k \propto 1/k^α$ parameterized by the exponent $α> 0$. We derive optimization scaling laws for deterministic gradient descent and sign descent as a proxy for Adam as a function of the exponent $α$. Existing theoretical investigations in scaling laws assume that the eigenvalues of the data decay as a power law with exponent $α> 1$. This assumption effectively makes the problem ``finite dimensional'' as most of the loss comes from a few of the largest eigencomponents. In comparison, we show that the problem is more difficult when the data have heavier tails. The case $α= 1$ as found in text data is ``worst-case'' for gradient descent, in that the number of iterations required to reach a small relative error scales almost linearly with dimension. While the performance of sign descent also depends on the dimension, for Zipf-distributed data the number of iterations scales only with the square-root of the dimension, leading to a large improvement for large vocabularies.
Efficient Data Selection at Scale via Influence Distillation
Nikdan, Mahdi, Cohen-Addad, Vincent, Alistarh, Dan, Mirrokni, Vahab
Effective data selection is critical for efficient training of modern Large Language Models (LLMs). This paper introduces Influence Distillation, a novel, mathematically-justified framework for data selection that employs second-order information to optimally weight training samples. By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data for LLM fine-tuning, guiding it toward strong performance on the target domain. We derive these optimal weights for both Gradient Descent and Adam optimizers. To ensure scalability and reduce computational cost, we propose a $\textit{landmark-based approximation}$: influence is precisely computed for a small subset of "landmark" samples and then efficiently propagated to all other samples to determine their weights. We validate Influence Distillation by applying it to instruction tuning on the Tulu V2 dataset, targeting a range of tasks including GSM8k, SQuAD, and MMLU, across several models from the Llama and Qwen families. Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to $3.5\times$ faster selection.
Leveraging Per-Instance Privacy for Machine Unlearning
Sepahvand, Nazanin Mohammadi, Thudi, Anvith, Isik, Berivan, Bhattacharyya, Ashmita, Papernot, Nicolas, Triantafillou, Eleni, Roy, Daniel M., Dziugaite, Gintare Karolina
We present a principled, per-instance approach to quantifying the difficulty of unlearning via fine-tuning. We begin by sharpening an analysis of noisy gradient descent for unlearning (Chien et al., 2024), obtaining a better utility-unlearning tradeoff by replacing worst-case privacy loss bounds with per-instance privacy losses (Thudi et al., 2024), each of which bounds the (Renyi) divergence to retraining without an individual data point. To demonstrate the practical applicability of our theory, we present empirical results showing that our theoretical predictions are born out both for Stochastic Gradient Langevin Dynamics (SGLD) as well as for standard fine-tuning without explicit noise. We further demonstrate that per-instance privacy losses correlate well with several existing data difficulty metrics, while also identifying harder groups of data points, and introduce novel evaluation methods based on loss barriers. All together, our findings provide a foundation for more efficient and adaptive unlearning strategies tailored to the unique properties of individual data points.
FedHL: Federated Learning for Heterogeneous Low-Rank Adaptation via Unbiased Aggregation
Peng, Zihao, Zeng, Jiandian, Li, Boyuan, Li, Guo, Chen, Shengbo, Wang, Tian
--Federated Learning (FL) facilitates the fine-tuning of Foundation Models (FMs) using distributed data sources, with Low-Rank Adaptation (LoRA) gaining popularity due to its low communication costs and strong performance. While recent work acknowledges the benefits of heterogeneous LoRA in FL and introduces flexible algorithms to support its implementation, our theoretical analysis reveals a critical gap: existing methods lack formal convergence guarantees due to parameter truncation and biased gradient updates. Specifically, adapting client-specific LoRA ranks necessitates truncating global parameters, which introduces inherent truncation errors and leads to subsequent inaccurate gradient updates that accumulate over training rounds, ultimately degrading performance. T o address the above issues, we propose FedHL, a simple yet effective Federated Learning framework tailored for Heterogeneous LoRA. By leveraging the full-rank global model as a calibrated aggregation basis, FedHL eliminates the direct truncation bias from initial alignment with client-specific ranks. Furthermore, we derive the theoretically optimal aggregation weights by minimizing the gradient drift term in the convergence upper bound. Our analysis shows that FedHL guarantees O (1 / T) convergence rate, and experiments on multiple real-world datasets demonstrate a 1-3% improvement over several state-of-the-art methods. Zihao Peng, Jiandian Zeng, and Guo Li are with the Institute of Artificial Intelligence and Future Networks, Beijing Normal University, Zhuhai 519087, China (e-mail: { pzh cs, liguo }@mail.bnu.edu.cn; Boyuan Li is with the School of Computer Science and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China (e-mail: l202311841010602@gs.zzu.edu.cn). Shengbo Chen is with the School of Software, Nanchang University, Nanchang 330000, China (e-mail: ccb02kingdom@gmail.com).
Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods
Tyurin, Alexander, Sivtsov, Danil
We propose a new unifying framework, Birch SGD, for analyzing and designing distributed SGD methods. The central idea is to represent each method as a weighted directed tree, referred to as a computation tree. Leveraging this representation, we introduce a general theoretical result that reduces convergence analysis to studying the geometry of these trees. This perspective yields a purely graph-based interpretation of optimization dynamics, offering a new and intuitive foundation for method development. Using Birch SGD, we design eight new methods and analyze them alongside previously known ones, with at least six of the new methods shown to have optimal computational time complexity. Our research leads to two key insights: (i) all methods share the same "iteration rate" of $O\left(\frac{(R + 1) L Δ}{\varepsilon} + \frac{σ^2 L Δ}{\varepsilon^2}\right)$, where $R$ the maximum "tree distance" along the main branch of a tree; and (ii) different methods exhibit different trade-offs-for example, some update iterates more frequently, improving practical performance, while others are more communication-efficient or focus on other aspects. Birch SGD serves as a unifying framework for navigating these trade-offs. We believe these results provide a unified foundation for understanding, analyzing, and designing efficient asynchronous and parallel optimization methods.
Certified Machine Unlearning via Noisy Stochastic Gradient Descent
The right to be forgotten'' ensured by laws for user data privacy becomes increasingly important. Machine unlearning aims to efficiently remove the effect of certain data points on the trained model parameters so that it can be approximately the same as if one retrains the model from scratch. We propose to leverage projected noisy stochastic gradient descent for unlearning and establish its first approximate unlearning guarantee under the convexity assumption. Our approach exhibits several benefits, including provable complexity saving compared to retraining, and supporting sequential and batch unlearning. Both of these benefits are closely related to our new results on the infinite Wasserstein distance tracking of the adjacent (un)learning processes.
Towards Exact Gradient-based Training on Analog In-memory Computing
Given the high economic and environmental costs of using large vision or language models, analog in-memory accelerators present a promising solution for energy-efficient AI. While inference on analog accelerators has been studied recently, the training perspective is underexplored. Recent studies have shown that the "workhorse" of digital AI training - stochastic gradient descent (SGD) algorithm converges inexactly when applied to model training on non-ideal devices. This paper puts forth a theoretical foundation for gradient-based training on analog devices. We begin by characterizing the non-convergent issue of SGD, which is caused by the asymmetric updates on the analog devices.