Goto

Collaborating Authors

 Gradient Descent


Sharpness-Aware Minimization and the Edge of Stability

arXiv.org Machine Learning

Sharpness-aware Minimization (SAM) [Foret et al., 2020] is a new gradient-based neural network training algorithm that advanced the state-of-the-art test accuracy on a number of prominent benchmark datasets. As its name suggests, it explicitly seeks to find a solution that not only fits the training data, but that avoids "sharp" minima, for which nearby parameter vectors perform poorly. SAM is an incremental algorithm that updates its parameters using a gradient computed at a neighbor of the current solution. The neighbor is the point in parameter space found by taking a step of length ρ "uphill" in the gradient direction. The practical success of SAM has motivated theoretical research [Bartlett et al., 2022, Wen et al., 2023, Andriushchenko et al., 2023], including results highlighting senses in which SAM's update may be viewed, under certain conditions, as including a component that performs gradient descent on the operator norm of the Hessian [Bartlett et al., 2022, Wen et al., 2023]. Meanwhile, Cohen et al. [2021], building on the work of Jastrzebski et al. [2020] and others, exposed a striking phenomenon regarding neural network training with the original gradient descent (GD) method: for many initialization schemes and learning rates η, the operator norm of the Hessian eventually settles in the neighborhood of 2/η. This has been termed the "edge of stability", in part because a convex quadratic trained by gradient descent with a learning rate η will only Also affiliated with University of California, Berkeley.


Fast Optimal Transport through Sliced Wasserstein Generalized Geodesics

arXiv.org Machine Learning

Wasserstein distance (WD) and the associated optimal transport plan have been proven useful in many applications where probability measures are at stake. In this paper, we propose a new proxy of the squared WD, coined min-SWGG, that is based on the transport map induced by an optimal one-dimensional projection of the two input distributions. We draw connections between min-SWGG and Wasserstein generalized geodesics in which the pivot measure is supported on a line. We notably provide a new closed form for the exact Wasserstein distance in the particular case of one of the distributions supported on a line allowing us to derive a fast computational scheme that is amenable to gradient descent optimization. We show that min-SWGG is an upper bound of WD and that it has a complexity similar to as Sliced-Wasserstein, with the additional feature of providing an associated transport plan. We also investigate some theoretical properties such as metricity, weak convergence, computational and topological properties. Empirical evidences support the benefits of min-SWGG in various contexts, from gradient flows, shape matching and image colorization, among others.


Universal Gradient Descent Ascent Method for Nonconvex-Nonconcave Minimax Optimization

arXiv.org Machine Learning

Nonconvex-nonconcave minimax optimization has received intense attention over the last decade due to its broad applications in machine learning. Most existing algorithms rely on one-sided information, such as the convexity (resp. concavity) of the primal (resp. dual) functions, or other specific structures, such as the Polyak-\L{}ojasiewicz (P\L{}) and Kurdyka-\L{}ojasiewicz (K\L{}) conditions. However, verifying these regularity conditions is challenging in practice. To meet this challenge, we propose a novel universally applicable single-loop algorithm, the doubly smoothed gradient descent ascent method (DS-GDA), which naturally balances the primal and dual updates. That is, DS-GDA with the same hyperparameters is able to uniformly solve nonconvex-concave, convex-nonconcave, and nonconvex-nonconcave problems with one-sided K\L{} properties, achieving convergence with $\mathcal{O}(\epsilon^{-4})$ complexity. Sharper (even optimal) iteration complexity can be obtained when the K\L{} exponent is known. Specifically, under the one-sided K\L{} condition with exponent $\theta\in(0,1)$, DS-GDA converges with an iteration complexity of $\mathcal{O}(\epsilon^{-2\max\{2\theta,1\}})$. They all match the corresponding best results in the literature. Moreover, we show that DS-GDA is practically applicable to general nonconvex-nonconcave problems even without any regularity conditions, such as the P\L{} condition, K\L{} condition, or weak Minty variational inequalities condition. For various challenging nonconvex-nonconcave examples in the literature, including ``Forsaken'', ``Bilinearly-coupled minimax'', ``Sixth-order polynomial'', and ``PolarGame'', the proposed DS-GDA can all get rid of limit cycles. To the best of our knowledge, this is the first first-order algorithm to achieve convergence on all of these formidable problems.


DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method

arXiv.org Machine Learning

We focus on gradient descent and its variants, as they are widely adopted and scale well when the model dimensionality d is large (Bottou et al., 2018). The optimization problem (OPT) finds many applications: in solving linear systems, logistic regression, support vector machines, and other areas of machine learning (Boyd and Vandenberghe, 2004). Equally important, methods designed for (stochastic) convex optimization also influence the intuition for and design of methods for nonconvex optimization-for example, momentum (Polyak, 1964), AdaGrad (Duchi et al., 2010), and Adam (Kingma and Ba, 2015) were all first analyzed in the convex optimization framework. As models become larger and more complex, the cost and environmental impact of training have rapidly grown as well (Sharir et al., 2020; Patterson et al., 2021). Therefore, it is vital that we develop more efficient and effective methods of solving machine learning optimization tasks. One of the chief challenges in applying gradient-based methods is that they often require tuning one or more stepsize parameters (Goodfellow et al., 2016), and the choice of stepsize can significantly influence a method's convergence speed as well as the quality of the obtained solutions, especially in deep learning (Wilson et al., 2017). The cost and impact of hyperparameter tuning on the optimization process have led to significant research activity in designing parameter-free and adaptive optimization methods in recent years, see e.g.


Proving Linear Mode Connectivity of Neural Networks via Optimal Transport

arXiv.org Artificial Intelligence

The energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neural network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.


Optimization Landscape of Policy Gradient Methods for Discrete-time Static Output Feedback

arXiv.org Artificial Intelligence

In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, L-smoothness, and M-Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss-Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when initialized near such minima. The paper concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.


FAMO: Fast Adaptive Multitask Optimization

arXiv.org Artificial Intelligence

One of the grand enduring goals of AI is to create generalist agents that can learn multiple different tasks from diverse data via multitask learning (MTL). However, in practice, applying gradient descent (GD) on the average loss across all tasks may yield poor multitask performance due to severe under-optimization of certain tasks. Previous approaches that manipulate task gradients for a more balanced loss decrease require storing and computing all task gradients ($\mathcal{O}(k)$ space and time where $k$ is the number of tasks), limiting their use in large-scale scenarios. In this work, we introduce Fast Adaptive Multitask Optimization FAMO, a dynamic weighting method that decreases task losses in a balanced way using $\mathcal{O}(1)$ space and time. We conduct an extensive set of experiments covering multi-task supervised and reinforcement learning problems. Our results indicate that FAMO achieves comparable or superior performance to state-of-the-art gradient manipulation techniques while offering significant improvements in space and computational efficiency. Code is available at \url{https://github.com/Cranial-XIX/FAMO}.


Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

arXiv.org Machine Learning

The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks. Therefore, implicit bias in non-smooth neural networks trained by gradient descent remains an open question. In this paper, we aim to answer this question by studying the implicit bias of gradient descent for training two-layer fully connected (leaky) ReLU neural networks. We showed that when the training data are nearly-orthogonal, for leaky ReLU activation function, gradient descent will find a network with a stable rank that converges to $1$, whereas for ReLU activation function, gradient descent will find a neural network with a stable rank that is upper bounded by a constant. Additionally, we show that gradient descent will find a neural network such that all the training data points have the same normalized margin asymptotically. Experiments on both synthetic and real data backup our theoretical findings.


Estimating the Rate-Distortion Function by Wasserstein Gradient Descent

arXiv.org Machine Learning

In the theory of lossy compression, the rate-distortion (R-D) function $R(D)$ describes how much a data source can be compressed (in bit-rate) at any given level of fidelity (distortion). Obtaining $R(D)$ for a given data source establishes the fundamental performance limit for all compression algorithms. We propose a new method to estimate $R(D)$ from the perspective of optimal transport. Unlike the classic Blahut--Arimoto algorithm which fixes the support of the reproduction distribution in advance, our Wasserstein gradient descent algorithm learns the support of the optimal reproduction distribution by moving particles. We prove its local convergence and analyze the sample complexity of our R-D estimator based on a connection to entropic optimal transport. Experimentally, we obtain comparable or tighter bounds than state-of-the-art neural network methods on low-rate sources while requiring considerably less tuning and computation effort. We also highlight a connection to maximum-likelihood deconvolution and introduce a new class of sources that can be used as test cases with known solutions to the R-D problem.


Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent

arXiv.org Machine Learning

Gaussian processes are a powerful framework for quantifying uncertainty and for sequential decision-making but are limited by the requirement of solving linear systems. In general, this has a cubic cost in dataset size and is sensitive to conditioning. We explore stochastic gradient algorithms as a computationally efficient method of approximately solving these linear systems: we develop low-variance optimization objectives for sampling from the posterior and extend these to inducing points. Counterintuitively, stochastic gradient descent often produces accurate predictions, even in cases where it does not converge quickly to the optimum. We explain this through a spectral characterization of the implicit bias from non-convergence. We show that stochastic gradient descent produces predictive distributions close to the true posterior both in regions with sufficient data coverage, and in regions sufficiently far away from the data. Experimentally, stochastic gradient descent achieves state-of-the-art performance on sufficiently large-scale or ill-conditioned regression tasks. Its uncertainty estimates match the performance of significantly more expensive baselines on a large-scale Bayesian optimization task.