AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Unbiased Estimation of the Gradient of the Log-Likelihood for a Class of Continuous-Time State-Space Models

arXiv.org Machine LearningMay-28-2021

In this paper, we consider static parameter estimation for a class of continuous-time state-space models. Our goal is to obtain an unbiased estimate of the gradient of the log-likelihood (score function), which is an estimate that is unbiased even if the stochastic processes involved in the model must be discretized in time. To achieve this goal, we apply a doubly randomized scheme, that involves a novel coupled conditional particle filter (CCPF) on the second level of randomization. Our novel estimate helps facilitate the application of gradient-based estimation algorithms, such as stochastic-gradient Langevin descent. We illustrate our methodology in the context of stochastic gradient descent (SGD) in several numerical examples and compare with the Rhee & Glynn estimator.

estimation, estimator, unbiased estimator, (17 more...)

arXiv.org Machine Learning

2105.11522

Country: North America > United States > New York (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.46)

Add feedback

A Primer On Gradient Descent

#artificialintelligenceMay-25-2021, 14:06:16 GMT

We spoke about how humans can easily locate where to take a step but how would numbers do this? How would numbers where the downward direction is? Above, we have a cost J which is a function of only one parameter w. This function can be represented through an equation in terms of w. To update w in order to reduce J we first start with a random value of w and then update the value using this equation. Here alpha is the learning rate we spoke about.

equation, gradient descent, primer, (1 more...)

#artificialintelligence

Genre: Overview (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.40)

Add feedback

Rank-one matrix estimation: analytic time evolution of gradient descent dynamics

Bodin, Antoine, Macris, Nicolas

arXiv.org Machine LearningMay-25-2021

We consider a rank-one symmetric matrix corrupted by additive noise. The rank-one matrix is formed by an $n$-component unknown vector on the sphere of radius $\sqrt{n}$, and we consider the problem of estimating this vector from the corrupted matrix in the high dimensional limit of $n$ large, by gradient descent for a quadratic cost function on the sphere. Explicit formulas for the whole time evolution of the overlap between the estimator and unknown vector, as well as the cost, are rigorously derived. In the long time limit we recover the well known spectral phase transition, as a function of the signal-to-noise ratio. The explicit formulas also allow to point out interesting transient features of the time evolution. Our analysis technique is based on recent progress in random matrix theory and uses local versions of the semi-circle law.

equation, gradient descent, rank-one matrix gradient descent dynamic, (9 more...)

arXiv.org Machine Learning

2105.12257

Country:

North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
(4 more...)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)

Add feedback

Practical Convex Formulation of Robust One-hidden-layer Neural Network Training

Bai, Yatong, Gautam, Tanmay, Gai, Yu, Sojoudi, Somayeh

arXiv.org Machine LearningMay-25-2021

Recent work has shown that the training of a one-hidden-layer, scalar-output fully-connected ReLU neural network can be reformulated as a finite-dimensional convex program. Unfortunately, the scale of such a convex program grows exponentially in data size. In this work, we prove that a stochastic procedure with a linear complexity well approximates the exact formulation. Moreover, we derive a convex optimization approach to efficiently solve the "adversarial training" problem, which trains neural networks that are robust to adversarial input perturbations. Our method can be applied to binary classification and regression, and provides an alternative to the current adversarial training methods, such as Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD). We demonstrate in experiments that the proposed method achieves a noticeably better adversarial robustness and performance than the existing methods.

adversarial training, matrix, neural network, (16 more...)

arXiv.org Machine Learning

2105.12237

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > South Korea (0.14)
North America > United States > California > Alameda County > Berkeley (0.04)
(7 more...)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

SGD with Coordinate Sampling: Theory and Practice

Leluc, Rémi, Portier, François

arXiv.org Machine LearningMay-25-2021

While classical forms of stochastic gradient descent algorithm treat the different coordinates in the same way, a framework allowing for adaptive (non uniform) coordinate sampling is developed to leverage structure in data. In a non-convex setting and including zeroth order gradient estimate, almost sure convergence as well as non-asymptotic bounds are established. Within the proposed framework, we develop an algorithm, MUSKETEER, based on a reinforcement strategy: after collecting information on the noisy gradients, it samples the most promising coordinate (all for one); then it moves along the one direction yielding an important decrease of the objective (one for all). Numerical experiments on both synthetic and real data examples confirm the effectiveness of MUSKETEER in large scale problems.

coordinate pass, query, regression, (14 more...)

arXiv.org Machine Learning

2105.11818

Country:

North America > United States (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Data Science > Data Mining (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback

Robust learning with anytime-guaranteed feedback

Holland, Matthew J.

arXiv.org Machine LearningMay-24-2021

Under data distributions which may be heavy-tailed, many stochastic gradient-based learning algorithms are driven by feedback queried at points with almost no performance guarantees on their own. Here we explore a modified "anytime online-to-batch" mechanism which for smooth objectives admits high-probability error bounds while requiring only lower-order moment bounds on the stochastic gradients. Using this conversion, we can derive a wide variety of "anytime robust" procedures, for which the task of performance analysis can be effectively reduced to regret control, meaning that existing regret bounds (for the bounded gradient case) can be robustified and leveraged in a straightforward manner. As a direct takeaway, we obtain an easily implemented stochastic gradient-based algorithm for which all queried points formally enjoy sub-Gaussian error bounds, and in practice show noteworthy gains on real-world data applications.

assumption, inequality, sequence, (14 more...)

arXiv.org Machine Learning

2105.11135

Country:

Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)

Genre:

Research Report (1.00)
Workflow (0.93)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Add feedback

Online Statistical Inference for Parameters Estimation with Linear-Equality Constraints

Liu, Ruiqi, Yuan, Mingao, Shang, Zuofeng

arXiv.org Machine LearningMay-21-2021

Stochastic gradient descent (SGD) and projected stochastic gradient descent (PSGD) are scalable algorithms to compute model parameters in unconstrained and constrained optimization problems. In comparison with stochastic gradient descent (SGD), PSGD forces its iterative values into the constrained parameter space via projection. The convergence rate of PSGD-type estimates has been exhaustedly studied, while statistical properties such as asymptotic distribution remain less explored. From a purely statistical point of view, this paper studies the limiting distribution of PSGD-based estimate when the true parameters satisfying some linear-equality constraints. Our theoretical findings reveal the role of projection played in the uncertainty of the PSGD estimate. As a byproduct, we propose an online hypothesis testing procedure to test the linear-equality constraints. Simulation studies on synthetic data and an application to a real-world dataset confirm our theory.

asgd estimate, conséquence, gradient descent, (15 more...)

arXiv.org Machine Learning

2105.10315

Country:

North America > United States > Texas (0.04)
North America > United States > North Dakota (0.04)
North America > United States > New Jersey (0.04)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Logarithmic landscape and power-law escape rate of SGD

Mori, Takashi, Ziyin, Liu, Liu, Kangqiao, Ueda, Masahito

arXiv.org Machine LearningMay-20-2021

Stochastic gradient descent (SGD) undergoes complicated multiplicative noise for the mean-square loss. We use this property of the SGD noise to derive a stochastic differential equation (SDE) with simpler additive noise by performing a non-uniform transformation of the time variable. In the SDE, the gradient of the loss is replaced by that of the logarithmized loss. Consequently, we show that, near a local or global minimum, the stationary distribution $P_\mathrm{ss}(\theta)$ of the network parameters $\theta$ follows a power-law with respect to the loss function $L(\theta)$, i.e. $P_\mathrm{ss}(\theta)\propto L(\theta)^{-\phi}$ with the exponent $\phi$ specified by the mini-batch size, the learning rate, and the Hessian at the minimum. We obtain the escape rate formula from a local minimum, which is determined not by the loss barrier height $\Delta L=L(\theta^s)-L(\theta^*)$ between a minimum $\theta^*$ and a saddle $\theta^s$ but by the logarithmized loss barrier height $\Delta\log L=\log[L(\theta^s)/L(\theta^*)]$. Our escape-rate formula explains an empirical fact that SGD prefers flat minima with low effective dimensions.

loss function, noise, sgd noise strength, (13 more...)

arXiv.org Machine Learning

2105.09557

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Asia > Japan > Honshū > Kantō > Saitama Prefecture > Saitama (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)

Add feedback

Escaping Saddle Points with Compressed SGD

Avdiukhin, Dmitrii, Yaroslavtsev, Grigory

arXiv.org Machine LearningMay-20-2021

Stochastic Gradient Descent (SGD) and its variants are the main workhorses of modern machine learning. Distributed implementations of SGD on a cluster of machines with a central server and a large number of workers are frequently used in practice due to the massive size of the data. In distributed SGD each machine holds a copy of the model and the computation proceeds in rounds. In every round, each worker finds a stochastic gradient based on its batch of examples, the server averages these stochastic gradients to obtain the gradient of the entire batch, makes an SGD step, and broadcasts the updated model parameters to the workers. With a large number of workers, computation parallelizes efficiently while communication becomes the main bottleneck [Chilimbi et al., 2014, Strom, 2015], since each worker needs to send its gradients to the server and receive the updated model parameters. Common solutions for this problem include: local SGD and its variants, when each machine performs multiple local steps before communication [Stich, 2018]; decentralized architectures which allow pairwise communication between the workers [McMahan et al., 2017] and gradient compression, when a compressed version of the gradient is communicated instead of the full gradient [Bernstein et al., 2018, Stich et al., 2018, Karimireddy et al., 2019]. In this work, we consider the latter approach, which we refer to as compressed SGD. Most machine learning models can be described by a d-dimensional vector of parameters x and the model quality can be estimated as a function f(x).

compressor, gradient, iteration, (15 more...)

arXiv.org Machine Learning

2105.1009

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Indiana (0.04)
(2 more...)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.96)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.75)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Global Convergence of Three-layer Neural Networks in the Mean Field Regime

Pham, Huy Tuan, Nguyen, Phan-Minh

arXiv.org Machine LearningMay-11-2021

In the mean field regime, neural networks are appropriately scaled so that as the width tends to infinity, the learning dynamics tends to a nonlinear and nontrivial dynamical limit, known as the mean field limit. This lends a way to study large-width neural networks via analyzing the mean field limit. Recent works have successfully applied such analysis to two-layer networks and provided global convergence guarantees. The extension to multilayer ones however has been a highly challenging puzzle, and little is known about the optimization efficiency in the mean field regime when there are more than two layers. In this work, we prove a global convergence result for unregularized feedforward three-layer networks in the mean field regime. We first develop a rigorous framework to establish the mean field limit of three-layer networks under stochastic gradient descent training. To that end, we propose the idea of a \textit{neuronal embedding}, which comprises of a fixed probability space that encapsulates neural networks of arbitrary sizes. The identified mean field limit is then used to prove a global convergence guarantee under suitable regularity and convergence mode assumptions, which -- unlike previous works on two-layer networks -- does not rely critically on convexity. Underlying the result is a universal approximation property, natural of neural networks, which importantly is shown to hold at \textit{any} finite training time (not necessarily at convergence) via an algebraic topology argument.

assumption 1, conference paper, neural network, (14 more...)

arXiv.org Machine Learning

2105.05228

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.68)

Add feedback