AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

Faster SGD training by minibatch persistency

Fischetti, Matteo, Mandatelli, Iacopo, Salvagnin, Domenico

arXiv.org Machine LearningJun-19-2018

It is well known that, for most datasets, the use of large-size minibatches for Stochastic Gradient Descent (SGD) typically leads to slow convergence and poor generalization. On the other hand, large minibatches are of great practical interest as they allow for a better exploitation of modern GPUs. Previous literature on the subject concentrated on how to adjust the main SGD parameters (in particular, the learning rate) when using large minibatches. In this work we introduce an additional feature, that we call minibatch persistency, that consists in reusing the same minibatch for K consecutive SGD iterations. The computational conjecture here is that a large minibatch contains a significant sample of the training set, so one can afford to slightly overfitting it without worsening generalization too much. The approach is intended to speedup SGD convergence, and also has the advantage of reducing the overhead related to data loading on the internal GPU memory. We present computational results on CIFAR-10 with an AlexNet architecture, showing that even small persistency values (K=2 or 5) already lead to a significantly faster convergence and to a comparable (or even better) generalization than the standard "disposable minibatch" approach (K=1), in particular when large minibatches are used. The lesson learned is that minibatch persistency can be a simple yet effective way to deal with large minibatches.

artificial intelligence, machine learning, minibatch, (15 more...)

arXiv.org Machine Learning

1806.07353

Country:

Europe > Italy (0.05)
North America > United States > California > Los Angeles County > Long Beach (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Inference in Deep Gaussian Processes using Stochastic Gradient Hamiltonian Monte Carlo

Havasi, Marton, Hernández-Lobato, José Miguel, Murillo-Fuentes, Juan José

arXiv.org Machine LearningJun-18-2018

Deep Gaussian Processes (DGPs) are hierarchical generalizations of Gaussian Processes that combine well calibrated uncertainty estimates with the high flexibility of multilayer models. One of the biggest challenges with these models is that exact inference is intractable. The current state-of-the-art inference method, Variational Inference (VI), employs a Gaussian approximation to the posterior distribution. This can be a potentially poor unimodal approximation of the generally multimodal posterior. In this work, we provide evidence for the non-Gaussian nature of the posterior and we apply the Stochastic Gradient Hamiltonian Monte Carlo method to directly sample from it. To efficiently optimize the hyperparameters, we introduce the Moving Window MCEM algorithm. This results in significantly better predictions at a lower computational cost than its VI counterpart. Thus our method establishes a new state-of-the-art for inference in DGPs.

artificial intelligence, dataset, machine learning, (15 more...)

arXiv.org Machine Learning

1806.0549

Genre: Research Report (0.82)

Industry: Energy (0.47)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.62)

Add feedback

Laplacian Smoothing Gradient Descent

Osher, Stanley, Wang, Bao, Yin, Penghang, Luo, Xiyang, Pham, Minh, Lin, Alex

arXiv.org Machine LearningJun-16-2018

We propose a very simple modification of gradient descent and stochastic gradient descent. We show that when applied to a variety of machine learning models including softmax regression, convolutional neural nets, generative adversarial nets, and deep reinforcement learning, this very simple surrogate can dramatically reduce the variance and improve the accuracy of the generalization. The new algorithm, (which depends on one nonnegative parameter) when applied to non-convex minimization, tends to avoid sharp local minima. Instead it seeks somewhat flatter local (and often global) minima. The method only involves preconditioning the gradient by the inverse of a tri-diagonal matrix that is positive definite. The motivation comes from the theory of Hamilton-Jacobi partial differential equations. This theory demonstrates that the new algorithm is almost the same as doing gradient descent on a new function which (a) has the same global minima as the original function and (b) is "more convex". Again, the programming effort in doing this is minimal, in cost, complexity and effort. We implement our algorithm into both PyTorch and Tensorflow platforms, which will be made publicly available.

artificial intelligence, deep learning, machine learning, (13 more...)

arXiv.org Machine Learning

1806.06317

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.15)
Europe > France (0.04)

Genre: Research Report (0.50)

Industry: Government > Regional Government > North America Government > United States Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

Nitanda, Atsushi, Suzuki, Taiji

arXiv.org Machine LearningJun-14-2018

We consider stochastic gradient descent for binary classification problems in a reproducing kernel Hilbert space. In traditional analysis, it is known that the expected classification error converges more slowly than the expected risk even when assuming a low-noise condition on the conditional label probabilities. Consequently, the resulting rate is sublinear. Therefore, it is important to consider whether much faster convergence of the expected classification error can be achieved. In recent research, an exponential convergence rate for stochastic gradient descent was shown under a strong low-noise condition, but theoretical analysis of this was limited to the square loss function, which is somewhat inadequate for binary classification tasks. In this paper, we show an exponential convergence rate of the expected classification error in the final phase of learning for a wide class of differentiable convex loss functions under similar assumptions.

artificial intelligence, machine learning, stochastic gradient descent, (16 more...)

arXiv.org Machine Learning

1806.05438

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback

PAC-Bayes Control: Synthesizing Controllers that Provably Generalize to Novel Environments

Majumdar, Anirudha, Goldstein, Maxwell

arXiv.org Artificial IntelligenceJun-14-2018

Our goal is to synthesize controllers for robots that provably generalize well to novel environments given a dataset of example environments. The key technical idea behind our approach is to leverage tools from generalization theory in machine learning by exploiting a precise analogy (which we present in the form of a reduction) between robustness of controllers to novel environments and generalization of hypotheses in supervised learning. In particular, we utilize the Probably Approximately Correct (PAC)-Bayes framework, which allows us to obtain upper bounds (that hold with high probability) on the expected cost of (stochastic) controllers across novel environments. We propose control synthesis algorithms that explicitly seek to minimize this upper bound. The corresponding optimization problem can be solved using convex optimization (Relative Entropy Programming in particular) in the setting where we are optimizing over a finite control policy space. In the more general setting of continuously parameterized controllers, we minimize this upper bound using stochastic gradient descent. We present examples of our approach in the context of obstacle avoidance control with depth measurements. Our simulated examples demonstrate the potential of our approach to provide strong generalization guarantees on controllers for robotic systems with continuous state and action spaces, complicated (e.g., nonlinear) dynamics, and rich sensory inputs (e.g., depth measurements).

artificial intelligence, controller, machine learning, (16 more...)

arXiv.org Artificial Intelligence

1806.04225

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New Jersey > Mercer County > Princeton (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Aerospace & Defense (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.68)
(4 more...)

Add feedback

Defending Against Saddle Point Attack in Byzantine-Robust Distributed Learning

Yin, Dong, Chen, Yudong, Ramchandran, Kannan, Bartlett, Peter

arXiv.org Machine LearningJun-14-2018

In this paper, we study robust large-scale distributed learning in the presence of saddle points in non-convex loss functions. We consider the Byzantine setting where some worker machines may have abnormal or even arbitrary and adversarial behavior. We argue that in the Byzantine setting, optimizing a non-convex function and escaping saddle points become much more challenging, even when robust gradient estimators are used. We develop ByzantinePGD, a robust and communication-efficient algorithm that can provably escape saddle points and converge to approximate local minimizers. The iteration complexity of our algorithm in the Byzantine setting matches that of standard gradient descent in the usual setting. We further provide three robust aggregation subroutines that can be used in ByzantinePGD, including median, trimmed mean, and iterative filtering. We characterize their performance in statistical settings, and argue for their near-optimality in different regimes including the high dimensional setting.

algorithm, artificial intelligence, machine learning, (15 more...)

arXiv.org Machine Learning

1806.05358

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)

Add feedback

INFERNO: Inference-Aware Neural Optimisation

de Castro, Pablo, Dorigo, Tommaso

arXiv.org Machine LearningJun-12-2018

Complex computer simulations are commonly required for accurate data modelling in many scientific disciplines, making statistical inference challenging due to the intractability of the likelihood evaluation for the observed data. Furthermore, sometimes one is interested on inference drawn over a subset of the generative model parameters while taking into account model uncertainty or misspecification on the remaining nuisance parameters. In this work, we show how non-linear summary statistics can be constructed by minimising inference-motivated losses via stochastic gradient descent.

artificial intelligence, machine learning, nuisance parameter, (17 more...)

arXiv.org Machine Learning

1806.04743

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)

Add feedback

Convergence of SGD in Learning ReLU Models with Separable Data

Xu, Tengyu, Zhou, Yi, Ji, Kaiyi, Liang, Yingbin

arXiv.org Machine LearningJun-12-2018

We consider the binary classification problem in which the objective function is the exponential loss with a ReLU model, and study the convergence property of the stochastic gradient descent (SGD) algorithm on linearly separable data. We show that the gradient descent (GD) algorithm do not always learn desirable model parameters due to the nonlinear ReLU model. Then, we identify a certain condition of data samples, under which we show that SGD can learn a proper classifier with implicit bias. In specific, we establish the sub-linear convergence rate of the function value generated by SGD to global minimum. We further show that SGD actually converges in expectation to the maximum margin classifier with respect to the samples with +1 label under the ReLU model at the rate O(1/ln t). We also extend our study to the case of multi-ReLU neurons, and show that SGD converges to a certain non-linear maximum margin classifier for a class of non-linearly separable data.

artificial intelligence, machine learning, sgd, (15 more...)

arXiv.org Machine Learning

1806.04339

Country: North America > United States > Ohio (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.90)

Add feedback

Meta-Learning for Stochastic Gradient MCMC

Gong, Wenbo, Li, Yingzhen, Hernández-Lobato, José Miguel

arXiv.org Machine LearningJun-12-2018

Stochastic gradient Markov chain Monte Carlo (SG-MCMC) has become increasingly popular for simulating posterior samples in large-scale Bayesian modeling. However, existing SG-MCMC schemes are not tailored to any specific probabilistic model, even a simple modification of the underlying dynamical system requires significant physical intuition. This paper presents the first meta-learning algorithm that allows automated design for the underlying continuous dynamics of an SG-MCMC sampler. The learned sampler generalizes Hamiltonian dynamics with state-dependent drift and diffusion, enabling fast traversal and efficient exploration of neural network energy landscapes. Experiments validate the proposed approach on both Bayesian fully connected neural network and Bayesian recurrent neural network tasks, showing that the learned sampler out-performs generic, hand-designed SG-MCMC algorithms, and generalizes to different datasets and larger architectures.

artificial intelligence, machine learning, sampler, (15 more...)

arXiv.org Machine Learning

1806.04522

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Germany > North Rhine-Westphalia > Upper Bavaria > Munich (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Research Report (0.50)
Instructional Material (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.48)
(2 more...)

Add feedback

Auto-Meta: Automated Gradient Based Meta Learner Search

Kim, Jaehong, Choi, Youngduck, Cha, Moonsu, Lee, Jung Kwon, Lee, Sangyeul, Kim, Sungwan, Choi, Yongseok, Kim, Jiwon

arXiv.org Artificial IntelligenceJun-11-2018

Fully automating machine learning pipeline is one of the outstanding challenges of general artificial intelligence, as practical machine learning often requires costly human driven process, such as hyper-parameter tuning, algorithmic selection, and model selection. In this work, we consider the problem of executing automated, yet scalable search for finding optimal gradient based meta-learners in practice. As a solution, we apply progressive neural architecture search to proto-architectures by appealing to the model agnostic nature of general gradient based meta learners. In the presence of recent universality result of Finn \textit{et al.}\cite{finn:universality_maml:DBLP:/journals/corr/abs-1710-11622}, our search is a priori motivated in that neural network architecture search dynamics---automated or not---may be quite different from that of the classical setting with the same target tasks, due to the presence of the gradient update operator. A posteriori, our search algorithm, given appropriately designed search spaces, finds gradient based meta learners with non-intuitive proto-architectures that are narrowly deep, unlike the inception-like structures previously observed in the resulting architectures of traditional NAS algorithms. Along with these notable findings, the searched gradient based meta-learner achieves state-of-the-art results on the few shot classification problem on Mini-ImageNet with $76.29\%$ accuracy, which is an $13.18\%$ improvement over results reported in the original MAML paper. To our best knowledge, this work is the first successful AutoML implementation in the context of meta learning.

artificial intelligence, deep learning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

1806.06927

Country:

Europe > Austria > Vienna (0.14)
Oceania > Australia > New South Wales > Sydney (0.05)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.36)

Add feedback