Goto

Collaborating Authors

 Gradient Descent


Delay-Tolerant Algorithms for Asynchronous Distributed Online Learning

Neural Information Processing Systems

We analyze new online gradient descent algorithms for distributed systems with large delays between gradient computations and the corresponding updates. Using insights from adaptive gradient methods, we develop algorithms that adapt not only to the sequence of gradients, but also to the precise update delays that occur. We first give an impractical algorithm that achieves a regret bound that precisely quantifies the impact of the delays. We then analyze AdaptiveRevision, an algorithm that is efficiently implementable and achieves comparable guarantees. The key algorithmic technique is appropriately and efficiently revising the learning rate used for previous gradient steps. Experimental results show when the delays grow large (1000 updates or more), our new algorithms perform significantly better than standard adaptive gradient methods.


Gradient Based Method for the Fusion of Lattice Quantizers

arXiv.org Artificial Intelligence

In practical applications, lattice quantizers leverage discrete lattice points to approximate arbitrary points in the lattice. An effective lattice quantizer significantly enhances both the accuracy and efficiency of these approximations. In the context of high-dimensional lattice quantization, previous work proposed utilizing low-dimensional optimal lattice quantizers and addressed the challenge of determining the optimal length ratio in orthogonal splicing. Notably, it was demonstrated that fixed length ratios and orthogonality yield suboptimal results when combining low-dimensional lattices. Building on this foundation, another approach employed gradient descent to identify optimal lattices, which inspired us to explore the use of neural networks to discover matrices that outperform those obtained from orthogonal splicing methods. We propose two novel approaches to tackle this problem: the Household Algorithm and the Matrix Exp Algorithm. Our results indicate that both the Household Algorithm and the Matrix Exp Algorithm achieve improvements in lattice quantizers across dimensions 13, 15, 17 to 19, 21, and 22. Moreover, the Matrix Exp Algorithm demonstrates superior efficacy in high-dimensional settings.


Propagation of Chaos for Mean-Field Langevin Dynamics and its Application to Model Ensemble

arXiv.org Machine Learning

Mean-field Langevin dynamics (MFLD) is an optimization method derived by taking the mean-field limit of noisy gradient descent for two-layer neural networks in the mean-field regime. Recently, the propagation of chaos (PoC) for MFLD has gained attention as it provides a quantitative characterization of the optimization complexity in terms of the number of particles and iterations. A remarkable progress by Chen et al. (2022) showed that the approximation error due to finite particles remains uniform in time and diminishes as the number of particles increases. In this paper, by refining the defective log-Sobolev inequality -- a key result from that earlier work -- under the neural network training setting, we establish an improved PoC result for MFLD, which removes the exponential dependence on the regularization coefficient from the particle approximation term of the optimization complexity. As an application, we propose a PoC-based model ensemble strategy with theoretical guarantees.


Bayesian Sampling Using Stochastic Gradient Thermostats

Neural Information Processing Systems

Dynamics-based sampling methods, such as Hybrid Monte Carlo (HMC) and Langevin dynamics (LD), are commonly used to sample target distributions. Recently, such approaches have been combined with stochastic gradient techniques to increase sampling efficiency when dealing with large datasets. An outstanding problem with this approach is that the stochastic gradient introduces an unknown amount of noise which can prevent proper sampling after discretization. To remedy this problem, we show that one can leverage a small number of additional variables to stabilize momentum fluctuations induced by the unknown noise. Our method is inspired by the idea of a thermostat in statistical physics and is justified by a general theory.


Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Neural Information Processing Systems

Stochastic gradient descent algorithms for training linear and kernel predictors are gaining more and more importance, thanks to their scalability. While various methods have been proposed to speed up their convergence, the model selection phase is often ignored. In fact, in theoretical works most of the time assumptions are made, for example, on the prior knowledge of the norm of the optimal solution, while in the practical world validation methods remain the only viable approach. In this paper, we propose a new kernel-based stochastic gradient descent algorithm that performs model selection while training, with no parameters to tune, nor any form of cross-validation. The algorithm builds on recent advancement in online learning theory for unconstrained settings, to estimate over time the right regularization in a data-dependent way. Optimal rates of convergence are proved under standard smoothness assumptions on the target function as well as preliminary empirical results.


Riemannian Manifold Learning for Stackelberg Games with Neural Flow Representations

arXiv.org Artificial Intelligence

We present a novel framework for online learning in Stackelberg general-sum games, where two agents, the leader and follower, engage in sequential turn-based interactions. At the core of this approach is a learned diffeomorphism that maps the joint action space to a smooth Riemannian manifold, referred to as the Stackelberg manifold. This mapping, facilitated by neural normalizing flows, ensures the formation of tractable isoplanar subspaces, enabling efficient techniques for online learning. By assuming linearity between the agents' reward functions on the Stackelberg manifold, our construct allows the application of standard bandit algorithms. We then provide a rigorous theoretical basis for regret minimization on convex manifolds and establish finite-time bounds on simple regret for learning Stackelberg equilibria. This integration of manifold learning into game theory uncovers a previously unrecognized potential for neural normalizing flows as an effective tool for multi-agent learning. We present empirical results demonstrating the effectiveness of our approach compared to standard baselines, with applications spanning domains such as cybersecurity and economic supply chain optimization.


Parameter Tracking in Federated Learning with Adaptive Optimization

arXiv.org Artificial Intelligence

In Federated Learning (FL), model training performance is strongly impacted by data heterogeneity across clients. Gradient Tracking (GT) has recently emerged as a solution which mitigates this issue by introducing correction terms to local model updates. To date, GT has only been considered under Stochastic Gradient Descent (SGD)-based model training, while modern FL frameworks increasingly employ adaptive optimizers for improved convergence. In this work, we generalize the GT framework to a more flexible Parameter Tracking (PT) paradigm and propose two novel adaptive optimization algorithms, {\tt FAdamET} and {\tt FAdamGT}, that integrate PT into Adam-based FL. We provide a rigorous convergence analysis of these algorithms under non-convex settings. Our experimental results demonstrate that both proposed algorithms consistently outperform existing methods when evaluating total communication cost and total computation cost across varying levels of data heterogeneity, showing the effectiveness of correcting first-order information in federated adaptive optimization.


Review for NeurIPS paper: Stochastic Recursive Gradient Descent Ascent for Stochastic Nonconvex-Strongly-Concave Minimax Problems

Neural Information Processing Systems

What is the lower-bound the paper refer to? - Currently the way that the paper is written does not highlight the novelties of the work. It seems the work combines existing methods for solving min-max with exiting variance reduction modules inside. Please clarify the novelties in the analysis further (the explanation in section 5 does not highlight the difficulties and challenges).


Review for NeurIPS paper: Stochastic Recursive Gradient Descent Ascent for Stochastic Nonconvex-Strongly-Concave Minimax Problems

Neural Information Processing Systems

The reviewers all agree that the improved complexity is new for the class of stochastic nonconvex-strongly-concave minimax problems, and the claim of optimal complexity is clarified in the rebuttal. The weakness pointed out by the reviewers is the seemingly lack of novelty by combining existing techniques of SGDA and SARAH gradient estimators, but the author rebuttal is convincing that the combination in the minimax setting requires innovation. Another weakness is that the SREDA algorithm is rather complex, having multi-level loops and can be hard to tune in practice. Overall I recommend acceptance based on the value of the theoretical improvement. The authors should carefully address the remaining concerns of the reviewers in the revision, especially to clarify the claim of optimal complexity.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

This paper is an essentially theoretical contribution regarding convergence rates for the so-called "Hogwild"-style algorithms for stochastic gradient descent. In these algorithms, the gradient step is produces asynchronously over different chunks of the dataset in parallel, with results updating current weights as they are completed, independent of other parallel updates. Previously, demonstrating theoretical convergence has been difficult and somewhat brittle. They show that one of their proven variants, "Buckwild" provides significant real-world speedups by using lower precision arithmetic to compute the gradient steps. As far as the paper goes, it is generally good. I had little trouble reading and understanding the paper (I think), and they make a point to explain the maths in an intuitive fashion, insofar as it is possible.