AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

High-dimensional SGD aligns with emerging outlier eigenspaces

Arous, Gerard Ben, Gheissari, Reza, Huang, Jiaoyang, Jagannath, Aukosh

arXiv.org Machine LearningOct-4-2023

We rigorously study the joint evolution of training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, the SGD trajectory rapidly aligns with emerging low-rank outlier eigenspaces of the Hessian and gradient matrices. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer's outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks.

artificial intelligence, machine learning, matrix, (17 more...)

arXiv.org Machine Learning

2310.0301

Country:

Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.05)
North America > Canada (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(5 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise

Gorbunov, Eduard, Sadiev, Abdurakhmon, Danilova, Marina, Horváth, Samuel, Gidel, Gauthier, Dvurechensky, Pavel, Gasnikov, Alexander, Richtárik, Peter

arXiv.org Artificial IntelligenceOct-3-2023

High-probability analysis of stochastic first-order optimization methods under mild assumptions on the noise has been gaining a lot of attention in recent years. Typically, gradient clipping is one of the key algorithmic ingredients to derive good high-probability guarantees when the noise is heavy-tailed. However, if implemented na\"ively, clipping can spoil the convergence of the popular methods for composite and distributed optimization (Prox-SGD/Parallel SGD) even in the absence of any noise. Due to this reason, many works on high-probability analysis consider only unconstrained non-distributed problems, and the existing results for composite/distributed problems do not include some important special cases (like strongly convex problems) and are not optimal. To address this issue, we propose new stochastic methods for composite and distributed optimization based on the clipping of stochastic gradient differences and prove tight high-probability convergence results (including nearly optimal ones) for the new methods. Using similar ideas, we also develop new methods for composite and distributed variational inequalities and analyze the high-probability convergence of these methods.

exp, inequality, probability, (14 more...)

arXiv.org Artificial Intelligence

2310.0186

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia > Central Federal District > Moscow Oblast > Moscow (0.04)
(3 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.45)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.34)

Add feedback

Symmetric Single Index Learning

Zweig, Aaron, Bruna, Joan

arXiv.org Artificial IntelligenceOct-3-2023

Few neural architectures lend themselves to provable learning with gradient based methods. One popular model is the single-index model, in which labels are produced by composing an unknown linear projection with a possibly unknown scalar link function. Learning this model with SGD is relatively well-understood, whereby the so-called information exponent of the link function governs a polynomial sample complexity rate. However, extending this analysis to deeper or more complicated architectures remains challenging. In this work, we consider single index learning in the setting of symmetric neural networks. Under analytic assumptions on the activation and maximum degree assumptions on the link function, we prove that gradient flow recovers the hidden planted direction, represented as a finitely supported vector in the feature space of power sum polynomials. We characterize a notion of information exponent adapted to our setting that controls the efficiency of learning.

assumption, gradient descent, proposition 2, (14 more...)

arXiv.org Artificial Intelligence

2310.02117

Country:

North America > United States > New York (0.04)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

Stochastic Gradient Descent with Preconditioned Polyak Step-size

Abdukhakimov, Farshed, Xiang, Chulu, Kamzolov, Dmitry, Takáč, Martin

arXiv.org Artificial IntelligenceOct-3-2023

Stochastic Gradient Descent (SGD) is one of the many iterative optimization methods that are widely used in solving machine learning problems. These methods display valuable properties and attract researchers and industrial machine learning engineers with their simplicity. However, one of the weaknesses of this type of methods is the necessity to tune learning rate (step-size) for every loss function and dataset combination to solve an optimization problem and get an efficient performance in a given time budget. Stochastic Gradient Descent with Polyak Step-size (SPS) is a method that offers an update rule that alleviates the need of fine-tuning the learning rate of an optimizer. In this paper, we propose an extension of SPS that employs preconditioning techniques, such as Hutchinson's method, Adam, and AdaGrad, to improve its performance on badly scaled and/or ill-conditioned datasets.

loss function, stochastic gradient descent, update rule, (11 more...)

arXiv.org Artificial Intelligence

2310.02093

Country:

Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
North America > United States > New York (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Asymmetric Momentum: A Rethinking of Gradient Descent

Zhang, Gongyue, Zhang, Dinghuang, Zhao, Shuwen, Liu, Donghan, Toptan, Carrie M., Liu, Honghai

arXiv.org Artificial IntelligenceOct-3-2023

Through theoretical and experimental validation, unlike all existing adaptive methods like Adam which penalize frequently-changing parameters and are only applicable to sparse gradients, we propose the simplest SGD enhanced method, Loss-Controlled Asymmetric Momentum(LCAM). By averaging the loss, we divide training process into different loss phases and using different momentum. It not only can accelerates slow-changing parameters for sparse gradients, similar to adaptive optimizers, but also can choose to accelerates frequently-changing parameters for non-sparse gradients, thus being adaptable to all types of datasets. We reinterpret the machine learning training process through the concepts of weight coupling and weight traction, and experimentally validate that weights have directional specificity, which are correlated with the specificity of the dataset. Thus interestingly, we observe that in non-sparse gradients, frequently-changing parameters should actually be accelerated, which is completely opposite to traditional adaptive perspectives. Compared to traditional SGD with momentum, this algorithm separates the weights without additional computational costs. It is noteworthy that this method relies on the network's ability to extract complex features. We primarily use Wide Residual Networks for our research, employing the classic datasets Cifar10 and Cifar100 to test the ability for feature separation and conclude phenomena that are much more important than just accuracy rates. Finally, compared to classic SGD tuning methods, while using WRN on these two datasets and with nearly half the training epochs, we achieve equal or better test accuracy.

dataset, gradient, momentum, (16 more...)

arXiv.org Artificial Intelligence

2309.0213

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

Add feedback

Nature Inspired Evolutionary Swarm Optimizers for Biomedical Image and Signal Processing -- A Systematic Review

Adhikary, Subhrangshu

arXiv.org Artificial IntelligenceOct-2-2023

The challenge of finding a global optimum in a solution search space with limited resources and higher accuracy has given rise to several optimization algorithms. Generally, the gradient-based optimizers converge to the global solution very accurately, but they often require a large number of iterations to find the solution. Researchers took inspiration from different natural phenomena and behaviours of many living organisms to develop algorithms that can solve optimization problems much quicker with high accuracy. These algorithms are called nature-inspired meta-heuristic optimization algorithms. These can be used for denoising signals, updating weights in a deep neural network, and many other cases. In the state-of-the-art, there are no systematic reviews available that have discussed the applications of nature-inspired algorithms on biomedical signal processing. The paper solves that gap by discussing the applications of such algorithms in biomedical signal processing and also provides an updated survey of the application of these algorithms in biomedical image processing. The paper reviews 28 latest peer-reviewed relevant articles and 26 nature-inspired algorithms and segregates them into thoroughly explored, lesser explored and unexplored categories intending to help readers understand the reliability and exploration stage of each of these algorithms.

algorithm, application, optimization algorithm, (16 more...)

arXiv.org Artificial Intelligence

2311.1283

Country: Asia > India > West Bengal (0.04)

Genre:

Research Report (1.00)
Overview (0.93)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
(2 more...)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
(3 more...)

Add feedback

Batch-less stochastic gradient descent for compressive learning of deep regularization for image denoising

Shi, Hui, Traonmilin, Yann, Aujol, J-F

arXiv.org Artificial IntelligenceOct-2-2023

We consider the problem of denoising with the help of prior information taken from a database of clean signals or images. Denoising with variational methods is very efficient if a regularizer well adapted to the nature of the data is available. Thanks to the maximum a posteriori Bayesian framework, such regularizer can be systematically linked with the distribution of the data. With deep neural networks (DNN), complex distributions can be recovered from a large training database.To reduce the computational burden of this task, we adapt the compressive learning framework to the learning of regularizers parametrized by DNN. We propose two variants of stochastic gradient descent (SGD) for the recovery of deep regularization parameters from a heavily compressed database. These algorithms outperform the initially proposed method that was limited to low-dimensional signals, each iteration using information from the whole database. They also benefit from classical SGD convergence guarantees. Thanks to these improvements we show that this method can be applied for patch based image denoising.}

batch-less stochastic gradient descent, compressive learning, deep regularization

arXiv.org Artificial Intelligence

2310.03085

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Intractability of Learning the Discrete Logarithm with Gradient-Based Methods

Takhanov, Rustem, Tezekbayev, Maxat, Pak, Artur, Bolatov, Arman, Kadyrsizova, Zhibek, Assylbekov, Zhenisbek

arXiv.org Artificial IntelligenceOct-2-2023

The discrete logarithm problem is a fundamental challenge in number theory with significant implications for cryptographic protocols. In this paper, we investigate the limitations of gradient-based methods for learning the parity bit of the discrete logarithm in finite cyclic groups of prime order. Our main result, supported by theoretical analysis and empirical verification, reveals the concentration of the gradient of the loss function around a fixed point, independent of the logarithm's base used. This concentration property leads to a restricted ability to learn the parity bit efficiently using gradient-based methods, irrespective of the complexity of the network architecture being trained. Our proof relies on Boas-Bellman inequality in inner product spaces and it involves establishing approximate orthogonality of discrete logarithm's parity bit functions through the spectral norm of certain matrices. Empirical experiments using a neural network-based approach further verify the limitations of gradient-based learning, demonstrating the decreasing success rate in predicting the parity bit as the group order increases.

discrete logarithm, gradient-based method, parity bit, (13 more...)

arXiv.org Artificial Intelligence

2310.01611

Country:

Asia > Kazakhstan > Akmola Region > Astana (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > Indiana > Allen County > Fort Wayne (0.04)
(4 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Elephant Neural Networks: Born to Be a Continual Learner

Lan, Qingfeng, Mahmood, A. Rupam

arXiv.org Artificial IntelligenceOct-2-2023

Catastrophic forgetting remains a significant challenge to continual learning for decades. While recent works have proposed effective methods to mitigate this problem, they mainly focus on the algorithmic side. Meanwhile, we do not fully understand what architectural properties of neural networks lead to catastrophic forgetting. This study aims to fill this gap by studying the role of activation functions in the training dynamics of neural networks and their impact on catastrophic forgetting. Our study reveals that, besides sparse representations, the gradient sparsity of activation functions also plays an important role in reducing forgetting. Based on this insight, we propose a new class of activation functions, elephant activation functions, that can generate both sparse representations and sparse gradients. We show that by simply replacing classical activation functions with elephant activation functions, we can significantly improve the resilience of neural networks to catastrophic forgetting. Our method has broad applicability and benefits for continual learning in regression, class incremental learning, and reinforcement learning tasks. Specifically, we achieves excellent performance on Split MNIST dataset in just one single pass, without using replay buffer, task boundary information, or pre-training. One of the biggest challenges to achieving continual learning is the decades-old issue of catastrophic forgetting (French 1999). Catastrophic forgetting stands for the phenomenon that artificial neural networks tend to forget prior knowledge drastically when learned with stochastic gradient descent algorithms on non-independent and identically distributed (non-iid) data.

activation function, learning, neural network, (13 more...)

arXiv.org Artificial Intelligence

2310.01365

Country:

North America > Canada > Alberta (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

From Stability to Chaos: Analyzing Gradient Descent Dynamics in Quadratic Regression

Chen, Xuxing, Balasubramanian, Krishnakumar, Ghosal, Promit, Agrawalla, Bhavya

arXiv.org Machine LearningOct-2-2023

We conduct a comprehensive investigation into the dynamics of gradient descent using large-order constant step-sizes in the context of quadratic regression models. Within this framework, we reveal that the dynamics can be encapsulated by a specific cubic map, naturally parameterized by the step-size. Through a fine-grained bifurcation analysis concerning the step-size parameter, we delineate five distinct training phases: (1) monotonic, (2) catapult, (3) periodic, (4) chaotic, and (5) divergent, precisely demarcating the boundaries of each phase. As illustrations, we provide examples involving phase retrieval and two-layer neural networks employing quadratic activation functions and constant outer-layers, utilizing orthogonal training data. Our simulations indicate that these five phases also manifest with generic non-orthogonal data. We also empirically investigate the generalization performance when training in the various non-monotonic (and non-divergent) phases. In particular, we observe that performing an ergodic trajectory averaging stabilizes the test error in non-monotonic (and non-divergent) phases.

artificial intelligence, log loss 3, machine learning, (9 more...)

arXiv.org Machine Learning

2310.01687

Country:

North America > United States > California > Yolo County > Davis (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)

Add feedback