AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Weak Convergence of Approximate reflection coupling and its Application to Non-convex Optimization

#artificialintelligenceMay-27-2022, 15:05:32 GMT

In this paper, we propose a weak approximation of the reflection coupling (RC) for stochastic differential equations (SDEs), and prove it converges weakly to the desired coupling. In contrast to the RC, the proposed approximate reflection coupling (ARC) need not take the hitting time of processes to the diagonal set into consideration and can be defined as the solution of some SDEs on the whole time interval. Therefore, ARC can work effectively against SDEs with different drift terms. As an application of ARC, an evaluation on the effectiveness of the stochastic gradient descent in a non-convex setting is also described. For the sample size n, the step size η, and the batch size B, we derive uniform evaluations on the time with orders n -1, η 1/2, and ((n - B) / B (n - 1)), respectively.

artificial intelligence, machine learning, reflection coupling, (7 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

What is "Stochastic" in Stochastic Gradient Descent (SGD)

#artificialintelligenceMay-22-2022, 14:31:39 GMT

Over the past 5 months, I had been reading the book Probability Essentials by Jean Jacod and Philip Protter, and the more time I spent on it, more I started to treat every encounter with Probability with a rigorous perspective. Recently, I was reading a paper in Deep Learning and the authors were talking about Stochastic Gradient Descent (SGD), which got me thinking, why is it called "stochastic"? Where is the randomness in it? Disclaimer: I won't be trying to explain any mathematical bits in this article solely because it is a pain to add equations. I hope the reader has some familiarity with the mathematical bits of the Gradient Descent algorithm and its variants. I'll provide a brief introduction where necessary, but won't be going into much detail.

gradient descent, iteration, loss function, (13 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Cutting Some Slack for SGD with Adaptive Polyak Stepsizes

Gower, Robert M., Blondel, Mathieu, Gazagnadou, Nidham, Pedregosa, Fabian

arXiv.org Artificial IntelligenceMay-20-2022

Tuning the step size of stochastic gradient descent is tedious and error prone. This has motivated the development of methods that automatically adapt the step size using readily available information. In this paper, we consider the family of SPS (Stochastic gradient with a Polyak Stepsize) adaptive methods. These are methods that make use of gradient and loss value at the sampled points to adaptively adjust the step size. We first show that SPS and its recent variants can all be seen as extensions of the Passive-Aggressive methods applied to nonlinear problems. We use this insight to develop new variants of the SPS method that are better suited to nonlinear models. Our new variants are based on introducing a slack variable into the interpolation equations. This single slack variable tracks the loss function across iterations and is used in setting a stable step size. We provide extensive numerical results supporting our new methods and a convergence theory.

artificial intelligence, constraint, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2202.12328

Country: Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Therapeutic Area > Oncology (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Guided Policy Search using Sequential Convex Programming for Initialization of Trajectory Optimization Algorithms

Kim, Taewan, Elango, Purnanand, Malyuta, Danylo, Acikmese, Behcet

arXiv.org Artificial IntelligenceMay-19-2022

Nonlinear trajectory optimization algorithms have been developed to handle optimal control problems with nonlinear dynamics and nonconvex constraints in trajectory planning. The performance and computational efficiency of many trajectory optimization methods are sensitive to the initial guess, i.e., the trajectory guess needed by the recursive trajectory optimization algorithm. Motivated by this observation, we tackle the initialization problem for trajectory optimization via policy optimization. To optimize a policy, we propose a guided policy search method that has two key components: i) Trajectory update; ii) Policy update. The trajectory update involves offline solutions of a large number of trajectory optimization problems from different initial states via Sequential Convex Programming (SCP). Here we take a single SCP step to generate the trajectory iterate for each problem. In conjunction with these iterates, we also generate additional trajectories around each iterate via a feedback control law. Then all these trajectories are used by a stochastic gradient descent algorithm to update the neural network policy, i.e., the policy update step. As a result, the trained policy makes it possible to generate trajectory candidates that are close to the optimality and feasibility and that provide excellent initial guesses for the trajectory optimization methods. We validate the proposed method via a real-world 6-degree-of-freedom powered descent guidance problem for a reusable rocket.

artificial intelligence, machine learning, trajectory, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.23919/ACC53348.2022.9867151

2110.06975

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Hawaii (0.04)

Genre: Research Report (0.64)

Industry: Aerospace & Defense (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Differentially private Riemannian optimization

Han, Andi, Mishra, Bamdev, Jawanpuria, Pratik, Gao, Junbin

arXiv.org Machine LearningMay-19-2022

In this paper, we study the differentially private empirical risk minimization problem where the parameter is constrained to a Riemannian manifold. We introduce a framework of differentially private Riemannian optimization by adding noise to the Riemannian gradient on the tangent space. The noise follows a Gaussian distribution intrinsically defined with respect to the Riemannian metric. We adapt the Gaussian mechanism from the Euclidean space to the tangent space compatible to such generalized Gaussian distribution. We show that this strategy presents a simple analysis as compared to directly adding noise on the manifold. We further show privacy guarantees of the proposed differentially private Riemannian (stochastic) gradient descent using an extension of the moments accountant technique. Additionally, we prove utility guarantees under geodesic (strongly) convex, general nonconvex objectives as well as under the Riemannian Polyak-{\L}ojasiewicz condition. We show the efficacy of the proposed framework in several applications.

artificial intelligence, differentially private riemannian optimization, machine learning

arXiv.org Machine Learning

2205.09494

Genre: Research Report (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.53)

Add feedback

Heavy-Tail Phenomenon in Decentralized SGD

Gurbuzbalaban, Mert, Hu, Yuanhan, Simsekli, Umut, Yuan, Kun, Zhu, Lingjiong

arXiv.org Machine LearningMay-16-2022

Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in modern machine learning applications. In this paper, we study the emergence of heavy-tails in decentralized stochastic gradient descent (DE-SGD), and investigate the effect of decentralization on the tail behavior. We first show that, when the loss function at each computational node is twice continuously differentiable and strongly convex outside a compact region, the law of the DE-SGD iterates converges to a distribution with polynomially decaying (heavy) tails. To have a more explicit control on the tail exponent, we then consider the case where the loss at each node is a quadratic, and show that the tail-index can be estimated as a function of the step-size, batch-size, and the topological properties of the network of the computational nodes. Then, we provide theoretical and empirical results showing that DE-SGD has heavier tails than centralized SGD. We also compare DE-SGD to disconnected SGD where nodes distribute the data but do not communicate. Our theory uncovers an interesting interplay between the tails and the network structure: we identify two regimes of parameters (stepsize and network size), where DE-SGD can have lighter or heavier tails than disconnected SGD depending on the regime. Finally, to support our theoretical results, we provide numerical experiments conducted on both synthetic data and neural networks.

artificial intelligence, de-sgd, machine learning, (17 more...)

arXiv.org Machine Learning

2205.06689

Country:

North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(2 more...)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with Linear Widths

Gao, Tianxiang, Gao, Hongyang

arXiv.org Machine LearningMay-16-2022

Implicit deep learning has recently become popular in the machine learning community since these implicit models can achieve competitive performance with state-of-the-art deep networks while using significantly less memory and computational resources. However, our theoretical understanding of when and how first-order methods such as gradient descent (GD) converge on \textit{nonlinear} implicit networks is limited. Although this type of problem has been studied in standard feed-forward networks, the case of implicit models is still intriguing because implicit networks have \textit{infinitely} many layers. The corresponding equilibrium equation probably admits no or multiple solutions during training. This paper studies the convergence of both gradient flow (GF) and gradient descent for nonlinear ReLU activated implicit networks. To deal with the well-posedness problem, we introduce a fixed scalar to scale the weight matrix of the implicit layer and show that there exists a small enough scaling constant, keeping the equilibrium equation well-posed throughout training. As a result, we prove that both GF and GD converge to a global minimum at a linear rate if the width $m$ of the implicit network is \textit{linear} in the sample size $N$, i.e., $m=\Omega(N)$.

artificial intelligence, machine learning, operator norm, (15 more...)

arXiv.org Machine Learning

2205.07463

Country:

North America > United States > Iowa (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)

Add feedback

Homogenization of SGD in high-dimensions: Exact dynamics and generalization properties

Paquette, Courtney, Paquette, Elliot, Adlam, Ben, Pennington, Jeffrey

arXiv.org Machine LearningMay-14-2022

We develop a stochastic differential equation, called homogenized SGD, for analyzing the dynamics of stochastic gradient descent (SGD) on a high-dimensional random least squares problem with $\ell^2$-regularization. We show that homogenized SGD is the high-dimensional equivalence of SGD -- for any quadratic statistic (e.g., population risk with quadratic loss), the statistic under the iterates of SGD converges to the statistic under homogenized SGD when the number of samples $n$ and number of features $d$ are polynomially related ($d^c < n < d^{1/c}$ for some $c > 0$). By analyzing homogenized SGD, we provide exact non-asymptotic high-dimensional expressions for the generalization performance of SGD in terms of a solution of a Volterra integral equation. Further we provide the exact value of the limiting excess risk in the case of quadratic losses when trained by SGD. The analysis is formulated for data matrices and target vectors that satisfy a family of resolvent conditions, which can roughly be viewed as a weak (non-quantitative) form of delocalization of sample-side singular vectors of the data. Several motivating applications are provided including sample covariance matrices with independent samples and random features with non-generative model targets.

artificial intelligence, machine learning, sgd, (18 more...)

arXiv.org Machine Learning

2205.07069

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > Canada > Quebec > Montreal (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(3 more...)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

GRAM: Fast Fine-tuning of Pre-trained Language Models for Content-based Collaborative Filtering

Yang, Yoonseok, Kim, Kyu Seok, Kim, Minsam, Park, Juneyoung

arXiv.org Artificial IntelligenceMay-6-2022

Content-based collaborative filtering (CCF) predicts user-item interactions based on both users' interaction history and items' content information. Recently, pre-trained language models (PLM) have been used to extract high-quality item encodings for CCF. However, it is resource-intensive to train a PLM-based CCF model in an end-to-end (E2E) manner, since optimization involves back-propagating through every content encoding within a given user interaction sequence. To tackle this issue, we propose GRAM (GRadient Accumulation for Multi-modality in CCF), which exploits the fact that a given item often appears multiple times within a batch of interaction histories. Specifically, Single-step GRAM aggregates each item encoding's gradients for back-propagation, with theoretic equivalence to the standard E2E training. As an extension of Single-step GRAM, we propose Multi-step GRAM, which increases the gradient update latency, achieving a further speedup with drastically less GPU memory. GRAM significantly improves training efficiency (up to 146x) on five datasets from two task domains of Knowledge Tracing and News Recommendation. Our code is available at https://github.com/yoonseok312/GRAM.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.18653/v1/2022.naacl-main.61

2204.04179

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.64)

Industry: Information Technology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.48)

Add feedback

Network Gradient Descent Algorithm for Decentralized Federated Learning

Wu, Shuyuan, Huang, Danyang, Wang, Hansheng

arXiv.org Machine LearningMay-5-2022

We study a fully decentralized federated learning algorithm, which is a novel gradient descent algorithm executed on a communication-based network. For convenience, we refer to it as a network gradient descent (NGD) method. In the NGD method, only statistics (e.g., parameter estimates) need to be communicated, minimizing the risk of privacy. Meanwhile, different clients communicate with each other directly according to a carefully designed network structure without a central master. This greatly enhances the reliability of the entire algorithm. Those nice properties inspire us to carefully study the NGD method both theoretically and numerically. Theoretically, we start with a classical linear regression model. We find that both the learning rate and the network structure play significant roles in determining the NGD estimator's statistical efficiency. The resulting NGD estimator can be statistically as efficient as the global estimator, if the learning rate is sufficiently small and the network structure is well balanced, even if the data are distributed heterogeneously. Those interesting findings are then extended to general models and loss functions. Extensive numerical studies are presented to corroborate our theoretical findings. Classical deep learning models are also presented for illustration purpose.

artificial intelligence, machine learning, network structure, (16 more...)

arXiv.org Machine Learning

2205.08364

Country:

Asia > Middle East > Jordan (0.04)
Asia > China > Beijing > Beijing (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.70)

Add feedback