AITopics

2309.12488

Country:

North America > United States > California > Alameda County > Berkeley (0.24)
North America > United States > California > Santa Clara County > Mountain View (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Mahey, Guillaume, Chapel, Laetitia, Gasso, Gilles, Bonet, Clément, Courty, Nicolas

Fast Optimal Transport through Sliced Wasserstein Generalized Geodesics

arXiv.org Machine LearningOct-30-2023

Wasserstein distance (WD) and the associated optimal transport plan have been proven useful in many applications where probability measures are at stake. In this paper, we propose a new proxy of the squared WD, coined min-SWGG, that is based on the transport map induced by an optimal one-dimensional projection of the two input distributions. We draw connections between min-SWGG and Wasserstein generalized geodesics in which the pivot measure is supported on a line. We notably provide a new closed form for the exact Wasserstein distance in the particular case of one of the distributions supported on a line allowing us to derive a fast computational scheme that is amenable to gradient descent optimization. We show that min-SWGG is an upper bound of WD and that it has a complexity similar to as Sliced-Wasserstein, with the additional feature of providing an associated transport plan. We also investigate some theoretical properties such as metricity, weak convergence, computational and topological properties. Empirical evidences support the benefits of min-SWGG in various contexts, from gradient flows, shape matching and image colorization, among others.

artificial intelligence, machine learning, optimization problem, (20 more...)

2307.0177

Country:

Europe > France > Normandy > Seine-Maritime > Rouen (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Energy (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

arXiv.org Machine LearningOct-30-2023

Universal Gradient Descent Ascent Method for Nonconvex-Nonconcave Minimax Optimization

Zheng, Taoli, Zhu, Linglingzhi, So, Anthony Man-Cho, Blanchet, Jose, Li, Jiajin

Nonconvex-nonconcave minimax optimization has received intense attention over the last decade due to its broad applications in machine learning. Most existing algorithms rely on one-sided information, such as the convexity (resp. concavity) of the primal (resp. dual) functions, or other specific structures, such as the Polyak-\L{}ojasiewicz (P\L{}) and Kurdyka-\L{}ojasiewicz (K\L{}) conditions. However, verifying these regularity conditions is challenging in practice. To meet this challenge, we propose a novel universally applicable single-loop algorithm, the doubly smoothed gradient descent ascent method (DS-GDA), which naturally balances the primal and dual updates. That is, DS-GDA with the same hyperparameters is able to uniformly solve nonconvex-concave, convex-nonconcave, and nonconvex-nonconcave problems with one-sided K\L{} properties, achieving convergence with $\mathcal{O}(\epsilon^{-4})$ complexity. Sharper (even optimal) iteration complexity can be obtained when the K\L{} exponent is known. Specifically, under the one-sided K\L{} condition with exponent $\theta\in(0,1)$, DS-GDA converges with an iteration complexity of $\mathcal{O}(\epsilon^{-2\max\{2\theta,1\}})$. They all match the corresponding best results in the literature. Moreover, we show that DS-GDA is practically applicable to general nonconvex-nonconcave problems even without any regularity conditions, such as the P\L{} condition, K\L{} condition, or weak Minty variational inequalities condition. For various challenging nonconvex-nonconcave examples in the literature, including ``Forsaken'', ``Bilinearly-coupled minimax'', ``Sixth-order polynomial'', and ``PolarGame'', the proposed DS-GDA can all get rid of limit cycles. To the best of our knowledge, this is the first first-order algorithm to achieve convergence on all of these formidable problems.

artificial intelligence, inequality, machine learning, (15 more...)

2212.12978

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
(2 more...)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.61)

Khaled, Ahmed, Mishchenko, Konstantin, Jin, Chi

DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method

We focus on gradient descent and its variants, as they are widely adopted and scale well when the model dimensionality d is large (Bottou et al., 2018). The optimization problem (OPT) finds many applications: in solving linear systems, logistic regression, support vector machines, and other areas of machine learning (Boyd and Vandenberghe, 2004). Equally important, methods designed for (stochastic) convex optimization also influence the intuition for and design of methods for nonconvex optimization-for example, momentum (Polyak, 1964), AdaGrad (Duchi et al., 2010), and Adam (Kingma and Ba, 2015) were all first analyzed in the convex optimization framework. As models become larger and more complex, the cost and environmental impact of training have rapidly grown as well (Sharir et al., 2020; Patterson et al., 2021). Therefore, it is vital that we develop more efficient and effective methods of solving machine learning optimization tasks. One of the chief challenges in applying gradient-based methods is that they often require tuning one or more stepsize parameters (Goodfellow et al., 2016), and the choice of stepsize can significantly influence a method's convergence speed as well as the quality of the obtained solutions, especially in deep learning (Wilson et al., 2017). The cost and impact of hyperparameter tuning on the optimization process have led to significant research activity in designing parameter-free and adaptive optimization methods in recent years, see e.g.

algorithm, gradient descent, stepsize, (13 more...)

2305.16284

Country:

North America > United States > California > Los Angeles County > Long Beach (0.14)
North America > United States > California > San Diego County > San Diego (0.04)
North America > Canada > Quebec > Montreal (0.04)
(10 more...)

Genre: Research Report (0.91)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Ferbach, Damien, Goujaud, Baptiste, Gidel, Gauthier, Dieuleveut, Aymeric

Proving Linear Mode Connectivity of Neural Networks via Optimal Transport

arXiv.org Artificial IntelligenceOct-29-2023

The energy landscape of high-dimensional non-convex optimization problems is crucial to understanding the effectiveness of modern deep neural network architectures. Recent works have experimentally shown that two different solutions found after two runs of a stochastic training are often connected by very simple continuous paths (e.g., linear) modulo a permutation of the weights. In this paper, we provide a framework theoretically explaining this empirical observation. Based on convergence rates in Wasserstein distance of empirical measures, we show that, with high probability, two wide enough two-layer neural networks trained with stochastic gradient descent are linearly connected. Additionally, we express upper and lower bounds on the width of each layer of two deep neural networks with independent neuron weights to be linearly connected. Finally, we empirically demonstrate the validity of our approach by showing how the dimension of the support of the weight distribution of neurons, which dictates Wasserstein convergence rates is correlated with linear mode connectivity.

lemma, linear mode connectivity, neural network, (12 more...)

arXiv.org Artificial Intelligence

2310.19103

Country:

North America > United States (0.14)
Europe > Latvia > Lubāna Municipality > Lubāna (0.04)
North America > Canada > Quebec > Montreal (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

arXiv.org Artificial IntelligenceOct-29-2023

Optimization Landscape of Policy Gradient Methods for Discrete-time Static Output Feedback

Duan, Jingliang, Li, Jie, Chen, Xuyang, Zhao, Kai, Li, Shengbo Eben, Zhao, Lin

In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, L-smoothness, and M-Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss-Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when initialized near such minima. The paper concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.

convergence, gradient method, policy gradient method, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/TCYB.2023.3323316

2310.19022

Country:

Asia > China > Beijing > Beijing (0.05)
Asia > Singapore > Central Region > Singapore (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(9 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Control Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

arXiv.org Artificial IntelligenceOct-29-2023

FAMO: Fast Adaptive Multitask Optimization

Liu, Bo, Feng, Yihao, Stone, Peter, Liu, Qiang

One of the grand enduring goals of AI is to create generalist agents that can learn multiple different tasks from diverse data via multitask learning (MTL). However, in practice, applying gradient descent (GD) on the average loss across all tasks may yield poor multitask performance due to severe under-optimization of certain tasks. Previous approaches that manipulate task gradients for a more balanced loss decrease require storing and computing all task gradients ($\mathcal{O}(k)$ space and time where $k$ is the number of tasks), limiting their use in large-scale scenarios. In this work, we introduce Fast Adaptive Multitask Optimization FAMO, a dynamic weighting method that decreases task losses in a balanced way using $\mathcal{O}(1)$ space and time. We conduct an extensive set of experiments covering multi-task supervised and reinforcement learning problems. Our results indicate that FAMO achieves comparable or superior performance to state-of-the-art gradient manipulation techniques while offering significant improvements in space and computational efficiency. Code is available at \url{https://github.com/Cranial-XIX/FAMO}.

famo, gradient, task loss, (10 more...)

arXiv.org Artificial Intelligence

2306.03792

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data

Kou, Yiwen, Chen, Zixiang, Gu, Quanquan

The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks. Therefore, implicit bias in non-smooth neural networks trained by gradient descent remains an open question. In this paper, we aim to answer this question by studying the implicit bias of gradient descent for training two-layer fully connected (leaky) ReLU neural networks. We showed that when the training data are nearly-orthogonal, for leaky ReLU activation function, gradient descent will find a network with a stable rank that converges to $1$, whereas for ReLU activation function, gradient descent will find a neural network with a stable rank that is upper bounded by a constant. Additionally, we show that gradient descent will find a neural network such that all the training data points have the same normalized margin asymptotically. Experiments on both synthetic and real data backup our theoretical findings.

artificial intelligence, inequality, machine learning, (17 more...)

2310.18935

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.27)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Estimating the Rate-Distortion Function by Wasserstein Gradient Descent

Yang, Yibo, Eckstein, Stephan, Nutz, Marcel, Mandt, Stephan

In the theory of lossy compression, the rate-distortion (R-D) function $R(D)$ describes how much a data source can be compressed (in bit-rate) at any given level of fidelity (distortion). Obtaining $R(D)$ for a given data source establishes the fundamental performance limit for all compression algorithms. We propose a new method to estimate $R(D)$ from the perspective of optimal transport. Unlike the classic Blahut--Arimoto algorithm which fixes the support of the reproduction distribution in advance, our Wasserstein gradient descent algorithm learns the support of the optimal reproduction distribution by moving particles. We prove its local convergence and analyze the sample complexity of our R-D estimator based on a connection to entropic optimal transport. Experimentally, we obtain comparable or tighter bounds than state-of-the-art neural network methods on low-rate sources while requiring considerably less tuning and computation effort. We also highlight a connection to maximum-likelihood deconvolution and introduce a new class of sources that can be used as test cases with known solutions to the R-D problem.

artificial intelligence, bayesian inference, machine learning, (18 more...)

2310.18908

Country:

Asia > Middle East > Jordan (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)

Lin, Jihao Andreas, Antorán, Javier, Padhy, Shreyas, Janz, David, Hernández-Lobato, José Miguel, Terenin, Alexander

Sampling from Gaussian Process Posteriors using Stochastic Gradient Descent

Gaussian processes are a powerful framework for quantifying uncertainty and for sequential decision-making but are limited by the requirement of solving linear systems. In general, this has a cubic cost in dataset size and is sensitive to conditioning. We explore stochastic gradient algorithms as a computationally efficient method of approximately solving these linear systems: we develop low-variance optimization objectives for sampling from the posterior and extend these to inducing points. Counterintuitively, stochastic gradient descent often produces accurate predictions, even in cases where it does not converge quickly to the optimum. We explain this through a spectral characterization of the implicit bias from non-convergence. We show that stochastic gradient descent produces predictive distributions close to the true posterior both in regions with sufficient data coverage, and in regions sufficiently far away from the data. Experimentally, stochastic gradient descent achieves state-of-the-art performance on sufficiently large-scale or ill-conditioned regression tasks. Its uncertainty estimates match the performance of significantly more expensive baselines on a large-scale Bayesian optimization task.

artificial intelligence, gaussian process, machine learning, (15 more...)

2306.11589

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
North America > United States (0.14)
North America > Canada > Alberta (0.14)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)