AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Provable Acceleration of Heavy Ball beyond Quadratics for a Class of Polyak-\L{}ojasiewicz Functions when the Non-Convexity is Averaged-Out

Wang, Jun-Kun, Lin, Chi-Heng, Wibisono, Andre, Hu, Bin

arXiv.org Artificial IntelligenceAug-29-2023

Heavy Ball (HB) nowadays is one of the most popular momentum methods in non-convex optimization. It has been widely observed that incorporating the Heavy Ball dynamic in gradient-based methods accelerates the training process of modern machine learning models. However, the progress on establishing its theoretical foundation of acceleration is apparently far behind its empirical success. Existing provable acceleration results are of the quadratic or close-to-quadratic functions, as the current techniques of showing HB's acceleration are limited to the case when the Hessian is fixed. In this work, we develop some new techniques that help show acceleration beyond quadratics, which is achieved by analyzing how the change of the Hessian at two consecutive time points affects the convergence speed. Based on our technical results, a class of Polyak-\L{}ojasiewicz (PL) optimization problems for which provable acceleration can be achieved via HB is identified. Moreover, our analysis demonstrates a benefit of adaptively setting the momentum parameter. (Update: 08/29/2023) Erratum is added in Appendix J. This is an updated version that fixes an issue in the previous version. An additional condition needs to be satisfied for the acceleration result of HB beyond quadratics in this work, which naturally holds when the dimension is one or, more broadly, when the Hessian is diagonal. We elaborate on the issue in Appendix J.

artificial intelligence, machine learning, provable acceleration, (18 more...)

arXiv.org Artificial Intelligence

2206.11872

Country:

North America > United States > Illinois (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Maryland > Baltimore (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Leisure & Entertainment > Sports > Tennis (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.48)

Add feedback

AdaTerm: Adaptive T-Distribution Estimated Robust Moments for Noise-Robust Stochastic Gradient Optimization

Ilboudo, Wendyam Eric Lionel, Kobayashi, Taisuke, Matsubara, Takamitsu

arXiv.org Artificial IntelligenceAug-29-2023

With the increasing practicality of deep learning applications, practitioners are inevitably faced with datasets corrupted by noise from various sources such as measurement errors, mislabeling, and estimated surrogate inputs/outputs that can adversely impact the optimization results. It is a common practice to improve the optimization algorithm's robustness to noise, since this algorithm is ultimately in charge of updating the network parameters. Previous studies revealed that the first-order moment used in Adam-like stochastic gradient descent optimizers can be modified based on the Student's t-distribution. While this modification led to noise-resistant updates, the other associated statistics remained unchanged, resulting in inconsistencies in the assumed models. In this paper, we propose AdaTerm, a novel approach that incorporates the Student's t-distribution to derive not only the first-order moment but also all the associated statistics. This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time. The proposed approach offers several advantages over previously proposed approaches, including reduced hyperparameters and improved robustness and adaptability. This noise-adaptive behavior contributes to AdaTerm's exceptional learning performance, as demonstrated through various optimization problems with different and/or unknown noise ratios. Furthermore, we introduce a new technique for deriving a theoretical regret bound without relying on AMSGrad, providing a valuable contribution to the field

artificial intelligence, machine learning, optimization problem, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.neucom.2023.126692

2201.06714

Country: Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Large-scale gradient-based training of Mixtures of Factor Analyzers

Gepperth, Alexander

arXiv.org Artificial IntelligenceAug-26-2023

Gaussian Mixture Models (GMMs) are a standard tool in data analysis. However, they face problems when applied to high-dimensional data (e.g., images) due to the size of the required full covariance matrices (CMs), whereas the use of diagonal or spherical CMs often imposes restrictions that are too severe. The Mixture of Factor analyzers (MFA) model is an important extension of GMMs, which allows to smoothly interpolate between diagonal and full CMs based on the number of \textit{factor loadings} $l$. MFA has successfully been applied for modeling high-dimensional image data. This article contributes both a theoretical analysis as well as a new method for efficient high-dimensional MFA training by stochastic gradient descent, starting from random centroid initializations. This greatly simplifies the training and initialization process, and avoids problems of batch-type algorithms such Expectation-Maximization (EM) when training with huge amounts of data. In addition, by exploiting the properties of the matrix determinant lemma, we prove that MFA training and inference/sampling can be performed based on precision matrices, which does not require matrix inversions after training is completed. At training time, the methods requires the inversion of $l\times l$ matrices only. Besides the theoretical analysis and proofs, we apply MFA to typical image datasets such as SVHN and MNIST, and demonstrate the ability to perform sample generation and outlier detection.

loading matrix, matrix, mfa, (17 more...)

arXiv.org Artificial Intelligence

2308.13778

Country: Europe > Germany (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

The Curse of Unrolling: Rate of Differentiating Through Optimization

Scieur, Damien, Bertrand, Quentin, Gidel, Gauthier, Pedregosa, Fabian

arXiv.org Machine LearningAug-25-2023

Computing the Jacobian of the solution of an optimization problem is a central problem in machine learning, with applications in hyperparameter optimization, meta-learning, optimization as a layer, and dataset distillation, to name a few. Unrolled differentiation is a popular heuristic that approximates the solution using an iterative solver and differentiates it through the computational path. This work provides a non-asymptotic convergence-rate analysis of this approach on quadratic objectives for gradient descent and the Chebyshev method. We show that to ensure convergence of the Jacobian, we can either 1) choose a large learning rate leading to a fast asymptotic convergence but accept that the algorithm may have an arbitrarily long burn-in phase or 2) choose a smaller learning rate leading to an immediate but slower convergence. We refer to this phenomenon as the curse of unrolling. Finally, we discuss open problems relative to this approach, such as deriving a practical update rule for the optimal unrolling strategy and making novel connections with the field of Sobolev orthogonal polynomials.

artificial intelligence, machine learning, polynomial, (16 more...)

arXiv.org Machine Learning

2209.13271

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States > Wisconsin (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Add feedback

Learning variational autoencoders via MCMC speed measures

Hirt, Marcel, Kreouzis, Vasileios, Dellaportas, Petros

arXiv.org Artificial IntelligenceAug-25-2023

Variational autoencoders (VAEs) are popular likelihood-based generative models which can be efficiently trained by maximizing an Evidence Lower Bound (ELBO). There has been much progress in improving the expressiveness of the variational distribution to obtain tighter variational bounds and increased generative performance. Whilst previous work has leveraged Markov chain Monte Carlo (MCMC) methods for the construction of variational densities, gradient-based methods for adapting the proposal distributions for deep latent variable models have received less attention. This work suggests an entropy-based adaptation for a short-run Metropolis-adjusted Langevin (MALA) or Hamiltonian Monte Carlo (HMC) chain while optimising a tighter variational bound to the log-evidence. Experiments show that this approach yields higher held-out log-likelihoods as well as improved generative metrics. Our implicit variational density can adapt to complicated posterior geometries of latent hierarchical representations arising in hierarchical VAEs.

artificial intelligence, dataset, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2308.13731

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Greece (0.04)
Asia > Singapore (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Network Embedding Using Sparse Approximations of Random Walks

Mercurio, Paula, Liu, Di

arXiv.org Artificial IntelligenceAug-25-2023

In this paper, we propose an efficient numerical implementation of Network Embedding based on commute times, using sparse approximation of a diffusion process on the network obtained by a modified version of the diffusion wavelet algorithm. The node embeddings are computed by optimizing the cross entropy loss via the stochastic gradient descent method with sampling of low-dimensional representations of green functions. We demonstrate the efficacy of this method for data clustering and multi-label classification through several examples, and compare its performance over existing methods in terms of efficiency and accuracy. Theoretical issues justifying the scheme are also discussed.

approximation, artificial intelligence, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2308.13663

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Minnesota > Rice County > Northfield (0.04)
North America > United States > Michigan > Ingham County > Lansing (0.04)
(3 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.48)

Add feedback

Min-Max Optimization under Delays

Adibi, Arman, Mitra, Aritra, Hassani, Hamed

arXiv.org Artificial IntelligenceAug-24-2023

Delays and asynchrony are inevitable in large-scale machine-learning problems where communication plays a key role. As such, several works have extensively analyzed stochastic optimization with delayed gradients. However, as far as we are aware, no analogous theory is available for min-max optimization, a topic that has gained recent popularity due to applications in adversarial robustness, game theory, and reinforcement learning. Motivated by this gap, we examine the performance of standard min-max optimization algorithms with delayed gradient updates. First, we show (empirically) that even small delays can cause prominent algorithms like Extra-gradient (\texttt{EG}) to diverge on simple instances for which \texttt{EG} guarantees convergence in the absence of delays. Our empirical study thus suggests the need for a careful analysis of delayed versions of min-max optimization algorithms. Accordingly, under suitable technical assumptions, we prove that Gradient Descent-Ascent (\texttt{GDA}) and \texttt{EG} with delayed updates continue to guarantee convergence to saddle points for convex-concave and strongly convex-strongly concave settings. Our complexity bounds reveal, in a transparent manner, the slow-down in convergence caused by delays.

artificial intelligence, machine learning, min-max optimization

arXiv.org Artificial Intelligence

2307.06886

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.53)

Add feedback

On-Manifold Projected Gradient Descent

Mahler, Aaron, Berry, Tyrus, Stephens, Tom, Antil, Harbir, Merritt, Michael, Schreiber, Jeanie, Kevrekidis, Ioannis

arXiv.org Artificial IntelligenceAug-23-2023

This work provides a computable, direct, and mathematically rigorous approximation to the differential geometry of class manifolds for high-dimensional data, along with nonlinear projections from input space onto these class manifolds. The tools are applied to the setting of neural network image classifiers, where we generate novel, on-manifold data samples, and implement a projected gradient descent algorithm for on-manifold adversarial training. The susceptibility of neural networks (NNs) to adversarial attack highlights the brittle nature of NN decision boundaries in input space. Introducing adversarial examples during training has been shown to reduce the susceptibility of NNs to adversarial attack; however, it has also been shown to reduce the accuracy of the classifier if the examples are not valid examples for that class. Realistic "on-manifold" examples have been previously generated from class manifolds in the latent of an autoencoder. Our work explores these phenomena in a geometric and computational setting that is much closer to the raw, high-dimensional input space than can be provided by VAE or other black box dimensionality reductions. We employ conformally invariant diffusion maps (CIDM) to approximate class manifolds in diffusion coordinates, and develop the Nystr\"{o}m projection to project novel points onto class manifolds in this setting. On top of the manifold approximation, we leverage the spectral exterior calculus (SEC) to determine geometric quantities such as tangent vectors of the manifold. We use these tools to obtain adversarial examples that reside on a class manifold, yet fool a classifier. These misclassifications then become explainable in terms of human-understandable manipulations within the data, by expressing the on-manifold adversary in the semantic basis on the manifold.

artificial intelligence, machine learning, manifold, (18 more...)

arXiv.org Artificial Intelligence

2308.12279

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > United States > Nevada > Clark County > Las Vegas (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(3 more...)

Genre:

Research Report (0.50)
Workflow (0.46)

Industry:

Information Technology > Security & Privacy (0.86)
Government > Military (0.86)
Automobiles & Trucks (0.68)
Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.60)

Add feedback

Layer-wise Feedback Propagation

Weber, Leander, Berend, Jim, Binder, Alexander, Wiegand, Thomas, Samek, Wojciech, Lapuschkin, Sebastian

arXiv.org Artificial IntelligenceAug-23-2023

In this paper, we present Layer-wise Feedback Propagation (LFP), a novel training approach for neural-network-like predictors that utilizes explainability, specifically Layer-wise Relevance Propagation(LRP), to assign rewards to individual connections based on their respective contributions to solving a given task. This differs from traditional gradient descent, which updates parameters towards anestimated loss minimum. LFP distributes a reward signal throughout the model without the need for gradient computations. It then strengthens structures that receive positive feedback while reducingthe influence of structures that receive negative feedback. We establish the convergence of LFP theoretically and empirically, and demonstrate its effectiveness in achieving comparable performance to gradient descent on various models and datasets. Notably, LFP overcomes certain limitations associated with gradient-based methods, such as reliance on meaningful derivatives. We further investigate how the different LRP-rules can be extended to LFP, what their effects are on training, as well as potential applications, such as training models with no meaningful derivatives, e.g., step-function activated Spiking Neural Networks (SNNs), or for transfer learning, to efficiently utilize existing knowledge.

artificial intelligence, gradient descent, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2308.12053

Country:

Europe > Austria > Vienna (0.14)
North America > Canada > Ontario > Toronto (0.14)
Europe > Germany > Berlin (0.04)
(18 more...)

Genre: Research Report > New Finding (0.46)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Add feedback

Emergent segmentation from participation dynamics and multi-learner retraining

Dean, Sarah, Curmei, Mihaela, Ratliff, Lillian J., Morgenstern, Jamie, Fazel, Maryam

arXiv.org Artificial IntelligenceAug-23-2023

The choice to participate in a data-driven service, often made on the basis of quality of that service, influences the ability of the service to learn and improve. We study the participation and retraining dynamics that arise when both the learners and sub-populations of users are \emph{risk-reducing}, which cover a broad class of updates including gradient descent, multiplicative weights, etc. Suppose, for example, that individuals choose to spend their time amongst social media platforms proportionally to how well each platform works for them. Each platform also gathers data about its active users, which it uses to update parameters with a gradient step. For this example and for our general class of dynamics, we show that the only asymptotically stable equilibria are segmented, with sub-populations allocated to a single learner. Under mild assumptions, the utilitarian social optimum is a stable equilibrium. In contrast to previous work, which shows that repeated risk minimization can result in representation disparity and high overall loss for a single learner \citep{hashimoto2018fairness,miller2021outside}, we find that repeated myopic updates with multiple learners lead to better outcomes. We illustrate the phenomena via a simulated example initialized from real data.

artificial intelligence, machine learning, social media, (19 more...)

arXiv.org Artificial Intelligence

2206.02667

Country:

North America > Mexico (0.04)
Europe > Russia (0.04)
Asia > Russia (0.04)
(9 more...)

Genre: Research Report (0.64)

Industry:

Leisure & Entertainment (0.93)
Media > Music (0.68)
Information Technology > Services (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback