Goto

Collaborating Authors

 Gradient Descent


Conformal Symplectic Optimization for Stable Reinforcement Learning

arXiv.org Artificial Intelligence

Training deep reinforcement learning (RL) agents necessitates overcoming the highly unstable nonconvex stochastic optimization inherent in the trial-and-error mechanism. To tackle this challenge, we propose a physics-inspired optimization algorithm called relativistic adaptive gradient descent (RAD), which enhances long-term training stability. By conceptualizing neural network (NN) training as the evolution of a conformal Hamiltonian system, we present a universal framework for transferring long-term stability from conformal symplectic integrators to iterative NN updating rules, where the choice of kinetic energy governs the dynamical properties of resulting optimization algorithms. By utilizing relativistic kinetic energy, RAD incorporates principles from special relativity and limits parameter updates below a finite speed, effectively mitigating abnormal gradient influences. Additionally, RAD models NN optimization as the evolution of a multi-particle system where each trainable parameter acts as an independent particle with an individual adaptive learning rate. We prove RAD's sublinear convergence under general nonconvex settings, where smaller gradient variance and larger batch sizes contribute to tighter convergence. Notably, RAD degrades to the well-known adaptive moment estimation (ADAM) algorithm when its speed coefficient is chosen as one and symplectic factor as a small positive value. Experimental results show RAD outperforming nine baseline optimizers with five RL algorithms across twelve environments, including standard benchmarks and challenging scenarios. Notably, RAD achieves up to a 155.1% performance improvement over ADAM in Atari games, showcasing its efficacy in stabilizing and accelerating RL training.


Anytime Acceleration of Gradient Descent

arXiv.org Machine Learning

This work investigates stepsize-based acceleration of gradient descent with {\em anytime} convergence guarantees. For smooth (non-strongly) convex optimization, we propose a stepsize schedule that allows gradient descent to achieve convergence guarantees of $O(T^{-1.119})$ for any stopping time $T$, where the stepsize schedule is predetermined without prior knowledge of the stopping time. This result provides an affirmative answer to a COLT open problem \citep{kornowski2024open} regarding whether stepsize-based acceleration can yield anytime convergence rates of $o(T^{-1})$. We further extend our theory to yield anytime convergence guarantees of $\exp(-\Omega(T/\kappa^{0.893}))$ for smooth and strongly convex optimization, with $\kappa$ being the condition number.


Memory-augmented Transformers can implement Linear First-Order Optimization Methods

arXiv.org Artificial Intelligence

We show that memory-augmented Transformers (Memformers) can implement linear first-order optimization methods such as conjugate gradient descent, momentum methods, and more generally, methods that linearly combine past gradients. Building on prior work that demonstrates how Transformers can simulate preconditioned gradient descent, we provide theoretical and empirical evidence that Memformers can learn more advanced optimization algorithms. Specifically, we analyze how memory registers in Memformers store suitable intermediate attention values allowing them to implement algorithms such as conjugate gradient. Our results show that Memformers can efficiently learn these methods by training on random linear regression tasks, even learning methods that outperform conjugate gradient. This work extends our knowledge about the algorithmic capabilities of Transformers, showing how they can learn complex optimization methods.


Stochastic Gradient Descent Revisited

arXiv.org Machine Learning

The advent of artificial intelligence (AI) has been rendered possible by the spectacular acceleration of computing chip capacity over the last few decades, and has driven a technological revolution that has not spared any aspect of life, including healthcare, supply chain management, social media, etc. AI describes a set of machine learning methods that abandon any form of structural representation of data and look instead into uncovering data patterns to produce probabilistic relationships between input and output quantities of interest. While it has significantly improved people's standards of living, AI has nevertheless engendered many operational risks (e.g. by producing undesirable or unexpected outcomes) as well as systemic risks (e.g. the "Flash Crash", whereby a blue-chip company's share price suddenly plummeted and bounced back in the span of minutes [KL13]). To better manage, prevent and mitigate such risks, some level of mathematical insight must be brought in to shed light onto the inner workings of AI, in order to allow practitioners and regulators alike to act upon it in order to increase its efficiency and curb its shortcomings. SGD is the engine of AI, making it a natural stepping stone toward mathematically explaining AI. Indeed, to capture their intricacies, machine learning problems are often modeled using wide and highly parametrized neural networks [GBC16], which are then solved using SGD or an adaptive variant thereof, namely Adagrad, Adadelta, RMSProp, Adamax or Adam [Rud17]. To approximate a stationary point of a given loss landscape (also referred to as objective or cost function [LZB22; AL24; AMA05]), SGD recursively spawns a trajectory of iterates by factoring in, at each step, a stochastic gradient modulated by a positive learning rate. Whereas classical SGD literature provides convergence guarantees and convergence rates within a (strongly) convex framework [Duf96; BV04; RM51], machine learning models are often highly nonconvex and require new SGD frameworks to better understand and parametrize them.


Learning High-Degree Parities: The Crucial Role of the Initialization

arXiv.org Artificial Intelligence

Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree $k$ parities on uniform inputs for constant $k$, but fail to do so when $k$ and $d-k$ grow with $d$ (here $d$ is the ambient dimension). However, the case where $k=d-O_d(1)$ (almost-full parities), including the degree $d$ parity (the full parity), has remained unsettled. This paper shows that for gradient descent on regular neural networks, learnability depends on the initial weight distribution. On one hand, the discrete Rademacher initialization enables efficient learning of almost-full parities, while on the other hand, its Gaussian perturbation with large enough constant standard deviation $\sigma$ prevents it. The positive result for almost-full parities is shown to hold up to $\sigma=O(d^{-1})$, pointing to questions about a sharper threshold phenomenon. Unlike statistical query (SQ) learning, where a singleton function class like the full parity is trivially learnable, our negative result applies to a fixed function and relies on an initial gradient alignment measure of potential broader relevance to neural networks learning.


Sharper Guarantees for Learning Neural Network Classifiers with Gradient Methods

arXiv.org Machine Learning

In this paper, we study the data-dependent convergence and generalization behavior of gradient methods for neural networks with smooth activation. Our first result is a novel bound on the excess risk of deep networks trained by the logistic loss, via an alogirthmic stability analysis. Compared to previous works, our results improve upon the shortcomings of the well-established Rademacher complexity-based bounds. Importantly, the bounds we derive in this paper are tighter, hold even for neural networks of small width, do not scale unfavorably with width, are algorithm-dependent, and consequently capture the role of initialization on the sample complexity of gradient descent for deep nets. Specialized to noiseless data separable with margin $\gamma$ by neural tangent kernel (NTK) features of a network of width $\Omega(\text{poly}(\log(n)))$, we show the test-error rate to be $e^{O(L)}/{\gamma^2 n}$, where $n$ is the training set size and $L$ denotes the number of hidden layers. This is an improvement in the test loss bound compared to previous works while maintaining the poly-logarithmic width conditions. We further investigate excess risk bounds for deep nets trained with noisy data, establishing that under a polynomial condition on the network width, gradient descent can achieve the optimal excess risk. Finally, we show that a large step-size significantly improves upon the NTK regime's results in classifying the XOR distribution. In particular, we show for a one-hidden-layer neural network of constant width $m$ with quadratic activation and standard Gaussian initialization that mini-batch SGD with linear sample complexity and with a large step-size $\eta=m$ reaches the perfect test accuracy after only $\ceil{\log(d)}$ iterations, where $d$ is the data dimension.


Communication Compression for Distributed Learning without Control Variates

arXiv.org Artificial Intelligence

Distributed learning algorithms, such as the ones employed in Federated Learning (FL), require communication compression to reduce the cost of client uploads. The compression methods used in practice are often biased, which require error feedback to achieve convergence when the compression is aggressive. In turn, error feedback requires client-specific control variates, which directly contradicts privacy-preserving principles and requires stateful clients. In this paper, we propose Compressed Aggregate Feedback (CAFe), a novel distributed learning framework that allows highly compressible client updates by exploiting past aggregated updates, and does not require control variates. We consider Distributed Gradient Descent (DGD) as a representative algorithm and provide a theoretical proof of CAFe's superiority to Distributed Compressed Gradient Descent (DCGD) with biased compression in the non-smooth regime with bounded gradient dissimilarity. Experimental results confirm that CAFe consistently outperforms distributed learning with direct compression and highlight the compressibility of the client updates with CAFe.


Pathwise optimization for bridge-type estimators and its applications

arXiv.org Machine Learning

Sparse parametric models are of great interest in statistical learning and are often analyzed by means of regularized estimators. Pathwise methods allow to efficiently compute the full solution path for penalized estimators, for any possible value of the penalization parameter $\lambda$. In this paper we deal with the pathwise optimization for bridge-type problems; i.e. we are interested in the minimization of a loss function, such as negative log-likelihood or residual sum of squares, plus the sum of $\ell^q$ norms with $q\in(0,1]$ involving adpative coefficients. For some loss functions this regularization achieves asymptotically the oracle properties (such as the selection consistency). Nevertheless, since the objective function involves nonconvex and nondifferentiable terms, the minimization problem is computationally challenging. The aim of this paper is to apply some general algorithms, arising from nonconvex optimization theory, to compute efficiently the path solutions for the adaptive bridge estimator with multiple penalties. In particular, we take into account two different approaches: accelerated proximal gradient descent and blockwise alternating optimization. The convergence and the path consistency of these algorithms are discussed. In order to assess our methods, we apply these algorithms to the penalized estimation of diffusion processes observed at discrete times. This latter represents a recent research topic in the field of statistics for time-dependent data.


Final-Model-Only Data Attribution with a Unifying View of Gradient-Based Methods

arXiv.org Machine Learning

Training data attribution (TDA) is the task of attributing model behavior to elements in the training data. This paper draws attention to the common setting where one has access only to the final trained model, and not the training algorithm or intermediate information from training. To serve as a gold standard for TDA in this "final-model-only" setting, we propose further training, with appropriate adjustment and averaging, to measure the sensitivity of the given model to training instances. We then unify existing gradient-based methods for TDA by showing that they all approximate the further training gold standard in different ways. We investigate empirically the quality of these gradient-based approximations to further training, for tabular, image, and text datasets and models. We find that the approximation quality of first-order methods is sometimes high but decays with the amount of further training. In contrast, the approximations given by influence function methods are more stable but surprisingly lower in quality.


Limit Theorems for Stochastic Gradient Descent with Infinite Variance

arXiv.org Machine Learning

Stochastic gradient descent is a classic algorithm that has gained great popularity especially in the last decades as the most common approach for training models in machine learning. While the algorithm has been well-studied when stochastic gradients are assumed to have a finite variance, there is significantly less research addressing its theoretical properties in the case of infinite variance gradients. In this paper, we establish the asymptotic behavior of stochastic gradient descent in the context of infinite variance stochastic gradients, assuming that the stochastic gradient is regular varying with index $\alpha\in(1,2)$. The closest result in this context was established in 1969 , in the one-dimensional case and assuming that stochastic gradients belong to a more restrictive class of distributions. We extend it to the multidimensional case, covering a broader class of infinite variance distributions. As we show, the asymptotic distribution of the stochastic gradient descent algorithm can be characterized as the stationary distribution of a suitably defined Ornstein-Uhlenbeck process driven by an appropriate stable L\'evy process. Additionally, we explore the applications of these results in linear regression and logistic regression models.