AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Learning Provably Improves the Convergence of Gradient Descent

arXiv.org Artificial IntelligenceFeb-1-2025

As a specialized branch of deep learning, Learning to Optimize (L2O) tackles optimization problems by training DNN-based solvers. Despite achieving significant success in various scenarios, such as faster convergence in solving convex optimizations and improved optimality in addressing non-convex cases, there remains a deficiency in theoretical support. Current research heavily relies on stringent assumptions that do not align with the intricacies of the training process. To address this gap, our study aims to establish L2O's convergence through its training methodology. We demonstrate that learning an algorithm's hyperparameters significantly enhances its convergence. Focusing on the gradient descent (GD) algorithm for quadratic programming, we prove the convergence of L2O's training using the neural tangent kernel theory. Moreover, we conduct empirical evaluations using synthetic datasets. Our findings indicate exceeding 50\% outperformance over the GD methods.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2501.18092

Country:

Europe > Russia (0.04)
Asia > Russia (0.04)
Asia > China (0.04)

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

$k$-SVD with Gradient Descent

Gan, Emily, Jedra, Yassir, Shah, Devavrat

arXiv.org Artificial IntelligenceFeb-1-2025

We show that a gradient-descent with a simple, universal rule for step-size selection provably finds $k$-SVD, i.e., the $k\geq 1$ largest singular values and corresponding vectors, of any matrix, despite nonconvexity. There has been substantial progress towards this in the past few years where existing results are able to establish such guarantees for the \emph{exact-parameterized} and \emph{over-parameterized} settings, with choice of oracle-provided step size. But guarantees for generic setting with a step size selection that does not require oracle-provided information has remained a challenge. We overcome this challenge and establish that gradient descent with an appealingly simple adaptive step size (akin to preconditioning) and random initialization enjoys global linear convergence for generic setting. Our convergence analysis reveals that the gradient method has an attracting region, and within this attracting region, the method behaves like Heron's method (a.k.a. the Babylonian method). Empirically, we validate the theoretical results. The emergence of modern compute infrastructure for iterative optimization coupled with this work is likely to provide means to solve $k$-SVD for very large matrices.

artificial intelligence, gradient descent, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2502.0032

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

SAGRAD: A Program for Neural Network Training with Simulated Annealing and the Conjugate Gradient Method

Bernal, Javier, Torres-Jimenez, Jose

arXiv.org Artificial IntelligenceJan-31-2025

SAGRAD (Simulated Annealing GRADient), a Fortran 77 program for computing neural networks for classification using batch learning, is discussed. Neural network training in SAGRAD is based on a combination of simulated annealing and M{\o}ller's scaled conjugate gradient algorithm, the latter a variation of the traditional conjugate gradient method, better suited for the nonquadratic nature of neural networks. Different aspects of the implementation of the training process in SAGRAD are discussed, such as the efficient computation of gradients and multiplication of vectors by Hessian matrices that are required by M{\o}ller's algorithm; the (re)initialization of weights with simulated annealing required to (re)start M{\o}ller's algorithm the first time and each time thereafter that it shows insufficient progress in reaching a possibly local minimum; and the use of simulated annealing when M{\o}ller's algorithm, after possibly making considerable progress, becomes stuck at a local minimum or flat area of weight space. Outlines of the scaled conjugate gradient algorithm, the simulated annealing procedure and the training process used in SAGRAD are presented together with results from running SAGRAD on two examples of training data.

algorithm, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.6028/jres.120.009

2502.00112

Country:

North America > United States > Oklahoma > Payne County > Cushing (0.04)
North America > Mexico > Tamaulipas (0.04)
Europe > Denmark (0.04)
(3 more...)

Genre:

Workflow (0.70)
Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Unraveling Zeroth-Order Optimization through the Lens of Low-Dimensional Structured Perturbations

Park, Sihwan, Yun, Jihun, Kim, SungYub, Kundu, Souvik, Yang, Eunho

arXiv.org Artificial IntelligenceJan-31-2025

Zeroth-order (ZO) optimization has emerged as a promising alternative to gradient-based backpropagation methods, particularly for black-box optimization and large language model (LLM) fine-tuning. However, ZO methods suffer from slow convergence due to high-variance stochastic gradient estimators. While structured perturbations, such as sparsity and low-rank constraints, have been explored to mitigate these issues, their effectiveness remains highly under-explored. In this work, we develop a unified theoretical framework that analyzes both the convergence and generalization properties of ZO optimization under structured perturbations. We show that high dimensionality is the primary bottleneck and introduce the notions of \textit{stable rank} and \textit{effective overlap} to explain how structured perturbations reduce gradient noise and accelerate convergence. Using the uniform stability under our framework, we then provide the first theoretical justification for why these perturbations enhance generalization. Additionally, through empirical analysis, we identify that \textbf{block coordinate descent} (BCD) to be an effective structured perturbation method. Extensive experiments show that, compared to existing alternatives, memory-efficient ZO (MeZO) with BCD (\textit{MeZO-BCD}) can provide improved converge with a faster wall-clock time/iteration by up to $\times\textbf{2.09}$ while yielding similar or better accuracy.

machine learning, natural language, optimization, (18 more...)

arXiv.org Artificial Intelligence

2501.19099

Country:

Asia > South Korea > Seoul > Seoul (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)

Add feedback

Byzantine-Resilient Zero-Order Optimization for Communication-Efficient Heterogeneous Federated Learning

Egger, Maximilian, Bakshi, Mayank, Bitar, Rawad

arXiv.org Machine LearningJan-31-2025

We introduce CyBeR-0, a Byzantine-resilient federated zero-order optimization method that is robust under Byzantine attacks and provides significant savings in uplink and downlink communication costs. We introduce transformed robust aggregation to give convergence guarantees for general non-convex objectives under client data heterogeneity. Empirical evaluations for standard learning tasks and fine-tuning large language models show that CyBeR-0 exhibits stable performance with only a few scalars per-round communication cost and reduced memory requirements.

byzantine-resilient zero-order optimization, machine learning, natural language, (15 more...)

arXiv.org Machine Learning

2502.00193

Country:

Asia > Afghanistan > Parwan Province > Charikar (0.04)
North America > United States > Arizona (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.46)

Add feedback

Fantastic Multi-Task Gradient Updates and How to Find Them In a Cone

Hassanpour, Negar, Janjua, Muhammad Kamran, Zhang, Kunlin, Lavasani, Sepehr, Zhang, Xiaowen, Zhou, Chunhua, Gao, Chao

arXiv.org Artificial IntelligenceJan-31-2025

Balancing competing objectives remains a fundamental challenge in multi-task learning (MTL), primarily due to conflicting gradients across individual tasks. A common solution relies on computing a dynamic gradient update vector that balances competing tasks as optimization progresses. Building on this idea, we propose ConicGrad, a principled, scalable, and robust MTL approach formulated as a constrained optimization problem. Our method introduces an angular constraint to dynamically regulate gradient update directions, confining them within a cone centered on the reference gradient of the overall objective. By balancing task-specific gradients without over-constraining their direction or magnitude, ConicGrad effectively resolves inter-task gradient conflicts. Moreover, our framework ensures computational efficiency and scalability to high-dimensional parameter spaces. We conduct extensive experiments on standard supervised learning and reinforcement learning MTL benchmarks, and demonstrate that ConicGrad achieves state-of-the-art performance across diverse tasks.

artificial intelligence, machine learning, optimization problem, (15 more...)

arXiv.org Artificial Intelligence

2502.00217

Country:

North America > Canada (0.04)
Europe > Italy > Tuscany > Florence (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.93)

Add feedback

Regularized Langevin Dynamics for Combinatorial Optimization

Feng, Shengyu, Yang, Yiming

arXiv.org Machine LearningJan-31-2025

This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradient-guided generative algorithm. However, we observed that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA) and the other one based on neural network (NN). Empirical results on three classical CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA and NN-based solvers. In particular, our SA algorithm reduces the running time of the previous SOTA SA method by up to 80\%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems.

graph, optimization, regularized langevin dynamic, (12 more...)

arXiv.org Machine Learning

2502.00277

Country:

North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Spain > Aragón (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Add feedback

In-Context Learning of Polynomial Kernel Regression in Transformers with GLU Layers

Sun, Haoyuan, Jadbabaie, Ali, Azizan, Navid

arXiv.org Artificial IntelligenceJan-30-2025

Transformer-based models have demonstrated remarkable ability in in-context learning (ICL), where they can adapt to unseen tasks from a prompt with a few examples, without requiring parameter updates. Recent research has provided insight into how linear Transformers can perform ICL by implementing gradient descent estimators. In particular, it has been shown that the optimal linear self-attention (LSA) mechanism can implement one step of gradient descent with respect to a linear least-squares objective when trained on random linear regression tasks. However, the theoretical understanding of ICL for nonlinear function classes remains limited. In this work, we address this gap by first showing that LSA is inherently restricted to solving linear least-squares objectives and thus, the solutions in prior works cannot readily extend to nonlinear ICL tasks. To overcome this limitation, drawing inspiration from modern architectures, we study a mechanism that combines LSA with GLU-like feed-forward layers and show that this allows the model to perform one step of gradient descent on a polynomial kernel regression. Further, we characterize the scaling behavior of the resulting Transformer model, highlighting the necessary model size to effectively handle quadratic ICL tasks. Our findings highlight the distinct roles of attention and feed-forward layers in nonlinear ICL and identify key challenges when extending ICL to nonlinear function classes.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2501.18187

Country: North America > United States > Massachusetts (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Estimating the Probability of Sampling a Trained Neural Network at Random

Scherlis, Adam, Belrose, Nora

arXiv.org Artificial IntelligenceJan-30-2025

They evaluate simple mass, under a Gaussian or uniform prior, gradient-free learning algorithms, such as the "Guess & of a region in neural network parameter space Check" optimizer which randomly samples parameters until corresponding to a particular behavior, such as it stumbles upon a network that achieves training loss achieving test loss below some threshold. When under some threshold, and find that these methods have the prior is uniform, this problem is equivalent similar generalization behavior to gradient descent, at least to measuring the volume of a region. We show on the very simple tasks they tested. Teney et al. (2024) empirically and theoretically that existing algorithms find that randomly initialized networks represent very simple for estimating volumes in parameter space functions, which would explain the simplicity bias of underestimate the true volume by millions of orders deep learning if SGD behaves similarly to Guess & Check. of magnitude. We find that this error can be dramatically reduced, but not entirely eliminated, Additionally, Mingard et al. (2021) provide evidence that with an importance sampling method using SGD may be an approximate Bayesian sampler, where the gradient information that is already provided prior distribution over functions is equal to the distribution by popular optimizers. The negative logarithm of over functions represented by randomly initialized networks.

artificial intelligence, machine learning, neighborhood, (19 more...)

arXiv.org Artificial Intelligence

2501.18812

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size

Oowada, Kanata, Iiduka, Hideaki

arXiv.org Machine LearningJan-30-2025

Many models used in machine learning have become so large that even computer computation of the full gradient of the loss function is impractical. This has made it necessary to efficiently train models using limited available information, such as batch size and learning rate. We have theoretically analyzed the use of Riemannian stochastic gradient descent (RSGD) and found that using an increasing batch size leads to faster RSGD convergence than using a constant batch size not only with a constant learning rate but also with a decaying learning rate, such as cosine annealing decay and polynomial decay. In particular, RSGD has a better convergence rate $O(\frac{1}{\sqrt{T}})$ than the existing rate $O(\frac{\sqrt{\log T}}{\sqrt[4]{T}})$ with a diminishing learning rate, where $T$ is the number of iterations. The results of experiments on principal component analysis and low-rank matrix completion problems confirmed that, except for the MovieLens dataset and a constant learning rate, using a polynomial growth batch size or an exponential growth batch size results in better performance than using a constant batch size.

artificial intelligence, gradf, machine learning, (18 more...)

arXiv.org Machine Learning

2501.18164

Country:

Asia > Middle East > Jordan (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Japan (0.04)
Asia > China (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback