AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

arXiv.org Machine LearningFeb-19-2019

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks. Yet, existing convergence guarantees for adaptive gradient methods require either convexity or smoothness, and, in the smooth setting, only guarantee convergence to a stationary point. We propose an adaptive gradient method and show that for two-layer over-parameterized neural networks -- if the width is sufficiently large (polynomially) -- then the proposed method converges \emph{to the global minimum} in polynomial time, and convergence is robust, \emph{ without the need to fine-tune hyper-parameters such as the step-size schedule and with the level of over-parametrization independent of the training error}. Our analysis indicates in particular that over-parametrization is crucial for the harnessing the full potential of adaptive gradient methods in the setting of neural networks.

condition 4, gradient descent, lemma 4, (13 more...)

arXiv.org Machine Learning

1902.07111

Country:

North America > United States > Texas > Travis County > Austin (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Asia > Middle East > Israel > Haifa District > Haifa (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.31)

Add feedback

Optimizing Stochastic Gradient Descent in Text Classification Based on Fine-Tuning Hyper-Parameters Approach. A Case Study on Automatic Classification of Global Terrorist Attacks

Diab, Shadi

arXiv.org Machine LearningFeb-18-2019

The objective of this research is to enhance performance of Stochastic Gradient Descent (SGD) algorithm in text classification. In our research, we proposed using SGD learning with Grid-Search approach to fine-tuning hyper-parameters in order to enhance the performance of SGD classification. We explored different settings for representation, transformation and weighting features from the summary description of terrorist attacks incidents obtained from the Global Terrorism Database as a pre-classification step, and validated SGD learning on Support Vector Machine (SVM), Logistic Regression and Perceptron classifiers by stratified 10-K-fold cross-validation to compare the performance of different classifiers embedded in SGD algorithm. The research concludes that using a grid-search to find the hyper-parameters optimize SGD classification, not in the pre-classification settings only, but also in the performance of the classifiers in terms of accuracy and execution time.

classification, classifier, stochastic gradient descent, (11 more...)

arXiv.org Machine Learning

1902.06542

Country:

North America > United States > Maryland (0.05)
Asia > Middle East > Palestine (0.05)
North America > United States > Texas > Travis County > Austin (0.04)
(7 more...)

Genre: Research Report > New Finding (0.71)

Industry: Law Enforcement & Public Safety > Terrorism (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

Mei, Song, Misiakiewicz, Theodor, Montanari, Andrea

arXiv.org Machine LearningFeb-15-2019

We consider learning two layer neural networks using stochastic gradient descent. The mean-field description of this learning dynamics approximates the evolution of the network weights by an evolution in the space of probability distributions in $R^D$ (where $D$ is the number of parameters associated to each neuron). This evolution can be defined through a partial differential equation or, equivalently, as the gradient flow in the Wasserstein space of probability distributions. Earlier work shows that (under some regularity assumptions), the mean field description is accurate as soon as the number of hidden units is much larger than the dimension $D$. In this paper we establish stronger and more general approximation guarantees. First of all, we show that the number of hidden units only needs to be larger than a quantity dependent on the regularity properties of the data, and independent of the dimensions. Next, we generalize this analysis to the case of unbounded activation functions, which was not covered by earlier bounds. We extend our results to noisy stochastic gradient descent. Finally, we show that kernel ridge regression can be recovered as a special limit of the mean field analysis.

inequality, neural network, probability, (16 more...)

arXiv.org Machine Learning

1902.06015

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.94)

Add feedback

A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Cao, Yuan, Gu, Quanquan

arXiv.org Machine LearningFeb-15-2019

Empirical studies show that gradient based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. While a line of recent work explains in theory that gradient-based methods with proper random initialization can find the global minima of the training loss in over-parameterized DNNs, it does not explain the good generalization performance of the gradient-based methods for learning over-parameterized DNNs. In this work, we take a step further, and prove that under certain assumption on the data distribution that is milder than linear separability, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small expected error (i.e., population error). This leads to an algorithmic-dependent generalization error bound for deep learning. To the best of our knowledge, this is the first result of its kind that can explain the good generalization performance of over-parameterized deep neural networks learned by gradient descent.

arxiv preprint arxiv, neural network, theorem 5, (12 more...)

arXiv.org Machine Learning

1902.01384

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

#010A Gradient Descent for Neural Networks Master Data Science 03.09.2018

#artificialintelligenceFeb-13-2019, 19:24:14 GMT

To train parameters of our algorithm we need to perform gradient descent. When training neural network, it is important to initialize the parameters randomly rather then to all zeros. More about that you can see in this post.

artificial intelligence, machine learning, network master data science 03, (1 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.79)

Add feedback

Robust Accelerated Gradient Methods for Smooth Strongly Convex Functions

Aybat, Necdet Serhat, Fallah, Alireza, Gurbuzbalaban, Mert, Ozdaglar, Asuman

arXiv.org Machine LearningFeb-13-2019

We study the trade-offs between convergence rate and robustness to gradient errors in designing a first-order algorithm. We focus on gradient descent (GD) and accelerated gradient (AG) methods for minimizing strongly convex functions when the gradient has random errors in the form of additive white noise. With gradient errors, the function values of the iterates need not converge to the optimal value; hence, we define the robustness of an algorithm to noise as the asymptotic expected suboptimality of the iterate sequence to input noise power. For this robustness measure, we provide exact expressions for the quadratic case using tools from robust control theory and tight upper bounds for the smooth strongly convex case using Lyapunov functions certified through matrix inequalities. We use these characterizations within an optimization problem which selects parameters of each algorithm to achieve a particular trade-off between rate and robustness. Our results show that AG can achieve acceleration while being more robust to random gradient errors. This behavior is quite different than previously reported in the deterministic gradient noise setting. We also establish some connections between the robustness of an algorithm and how quickly it can converge back to the optimal solution if it is perturbed from the optimal point with deterministic noise. Our framework also leads to practical algorithms that can perform better than other state-of-the-art methods in the presence of random gradient noise.

algorithm, convergence rate, robustness, (15 more...)

arXiv.org Machine Learning

1805.10579

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania > Centre County > University Park (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Stochastic Gradient Descent Escapes Saddle Points Efficiently

Jin, Chi, Netrapalli, Praneeth, Ge, Rong, Kakade, Sham M., Jordan, Michael I.

arXiv.org Machine LearningFeb-13-2019

This paper considers the perturbed stochastic gradient descent algorithm and shows that it finds $\epsilon$-second order stationary points ($\left\|\nabla f(x)\right\|\leq \epsilon$ and $\nabla^2 f(x) \succeq -\sqrt{\epsilon} \mathbf{I}$) in $\tilde{O}(d/\epsilon^4)$ iterations, giving the first result that has linear dependence on dimension for this setting. For the special case, where stochastic gradients are Lipschitz, the dependence on dimension reduces to polylogarithmic. In addition to giving new results, this paper also presents a simplified proof strategy that gives a shorter and more elegant proof of previously known results (Jin et al. 2017) on perturbed gradient descent algorithm.

iteration, probability, stationary point, (12 more...)

arXiv.org Machine Learning

1902.04811

Country:

Asia > Middle East > Jordan (0.05)
North America > United States > California > Alameda County > Berkeley (0.04)
North America > United States > Washington > King County > Seattle (0.04)
(2 more...)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Learning Ising Models with Independent Failures

Goel, Surbhi, Kane, Daniel M., Klivans, Adam R.

arXiv.org Machine LearningFeb-12-2019

We give the first efficient algorithm for learning the structure of an Ising model that tolerates independent failures; that is, each entry of the observed sample is missing with some unknown probability p. Our algorithm matches the essentially optimal runtime and sample complexity bounds of recent work for learning Ising models due to Klivans and Meka (2017). We devise a novel unbiased estimator for the gradient of the Interaction Screening Objective (ISO) due to Vuffray et al. (2016) and apply a stochastic multiplicative gradient descent algorithm to minimize this objective. Solutions to this minimization recover the neighborhood information of the underlying Ising model on a node by node basis.

estimator, exp, ising model, (13 more...)

arXiv.org Machine Learning

1902.04728

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Towards moderate overparameterization: global convergence guarantees for training shallow neural networks

Oymak, Samet, Soltanolkotabi, Mahdi

arXiv.org Machine LearningFeb-12-2019

Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and in many cases overparameterized models seem to perfectly interpolate the training data as soon as the number of parameters exceed the size of the training data by a constant factor. Thus there is a huge gap between the existing theoretical literature and practical experiments. In this paper we take a step towards closing this gap. Focusing on shallow neural nets and smooth activations, we show that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Our results also benefit from a fast convergence rate and continue to hold for non-differentiable activations such as Rectified Linear Units (ReLUs).

arxiv preprint arxiv, neural network, probability, (13 more...)

arXiv.org Machine Learning

1902.04674

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > California > Riverside County > Riverside (0.14)
North America > United States > Rhode Island > Providence County > Providence (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.91)

Add feedback

Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning

Zhang, Ruqi, Li, Chunyuan, Zhang, Jianyi, Chen, Changyou, Wilson, Andrew Gordon

arXiv.org Machine LearningFeb-11-2019

The posteriors over neural network weights are high dimensional and multimodal. Each mode typically characterizes a meaningfully different representation of the data. We develop Cyclical Stochastic Gradient MCMC (SG-MCMC) to automatically explore such distributions. In particular, we propose a cyclical stepsize schedule, where larger steps discover new modes, and smaller steps characterize each mode. We prove that our proposed learning rate schedule provides faster convergence to samples from a stationary distribution than SG-MCMC with standard decaying schedules. Moreover, we provide extensive experimental results to demonstrate the effectiveness of cyclical SG-MCMC in learning complex multimodal distributions, especially for fully Bayesian inference with modern deep neural networks.

algorithm, cyclical stochastic gradient mcmc, sg-mcmc, (12 more...)

arXiv.org Machine Learning

1902.03932

Country:

North America > United States > New York (0.04)
Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
Asia > Afghanistan > Parwan Province > Charikar (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)

Add feedback