AITopics | Gradient Descent

Collaborating Authors

Gradient Descent

News Overviews Instructional Materials AI-Alerts Classics

A Comprehensive Guide to Stochastic Gradient Descent Algorithms

#artificialintelligenceNov-18-2019, 02:53:16 GMT

Unfortunately, the reality is a little bit different, in particular in deep models, where the number of parameters is in the order of ten or one hundred million. When the system is relatively shallow, it's easier to find local minima where the training process can stop, while in deeper models, the probability of a local minimum becomes smaller and, instead, saddle points become more and more likely.

comprehensive guide, stochastic gradient descent algorithm

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

WITCHcraft: Efficient PGD attacks with random step size

Chiang, Ping-Yeh, Geiping, Jonas, Goldblum, Micah, Goldstein, Tom, Ni, Renkun, Reich, Steven, Shafahi, Ali

arXiv.org Machine LearningNov-18-2019

State-of-the-art adversarial attacks on neural networks use expensive iterative methods and numerous random restarts from different initial points. Iterative FGSM-based methods without restarts trade off performance for computational efficiency because they do not adequately explore the image space and are highly sensitive to the choice of step size. We propose a variant of Projected Gradient Descent (PGD) that uses a random step size to improve performance without resorting to expensive random restarts. Our method, Wide Iterative Stochastic crafting (WITCHcraft), achieves results superior to the classical PGD attack on the CIFAR-10 and MNIST data sets but without additional computational cost. This simple modification of PGD makes crafting attacks more economical, which is important in situations like adversarial training where attacks need to be crafted in real time.

arxiv preprint arxiv, pgd attack, step size, (13 more...)

arXiv.org Machine Learning

1911.07989

Country: North America > United States > Maryland > Prince George's County > College Park (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.72)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

vqSGD: Vector Quantized Stochastic Gradient Descent

Gandikota, Venkata, Maity, Raj Kumar, Mazumdar, Arya

arXiv.org Machine LearningNov-18-2019

For any c R d, and r 0, let B d(c, r) denote a d -dimensional null 2 ball of radius r centered at c . Let e i R d denote the i -th standard basis vector which has 1 in the i -th position and 0 everywhere else. Also, let 1 d and 0 d denote the all 1's vector and all 0's vector in R d respectively. By [n ] we denote the set { 1, 2,..., n } . For a discrete set of points C R d, let conv (C) denote the convex hull of points in C, i.e.,, conv(C): null null c C a cc a c 0, null c C a c 1 null . Suppose w R d be the parameters of a function to be learned (such as weights of a neural network). In each step of the SGD algorithm, the parameters are updated as w w η ˆ д, where η is a possibly time-varying learning rate and ˆ д is a stochastic unbiased estimate of д, the true gradient of some loss function with respect to w .

communication, gradient, quantization scheme, (15 more...)

arXiv.org Machine Learning

1911.07971

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)

Add feedback

A Graph Autoencoder Approach to Causal Structure Learning

Ng, Ignavier, Zhu, Shengyu, Chen, Zhitang, Fang, Zhuangyan

arXiv.org Machine LearningNov-17-2019

Causal structure learning has been a challenging task in the past decades and several mainstream approaches such as constraint- and score-based methods have been studied with theoretical guarantees. Recently, a new approach has transformed the combinatorial structure learning problem into a continuous one and then solved it using gradient-based optimization methods. Following the recent state-of-the-arts, we propose a new gradient-based method to learn causal structures from observational data. The proposed method generalizes the recent gradient-based methods to a graph autoencoder framework that allows nonlinear structural equation models and is easily applicable to vector-valued variables. We demonstrate that on synthetic datasets, our proposed method outperforms other gradient-based methods significantly, especially on large causal graphs. We further investigate the scalability and efficiency of our method, and observe a near linear training time when scaling up the graph size.

dataset, gradient-based method, training time, (14 more...)

arXiv.org Machine Learning

1911.0742

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.77)

Add feedback

Stochastic Gradient Annealed Importance Sampling for Efficient Online Marginal Likelihood Estimation

Cameron, Scott A., Eggers, Hans C., Kroon, Steve

arXiv.org Machine LearningNov-17-2019

We consider estimating the marginal likelihood in settings with independent and identically distributed (i.i.d.) data. We propose estimating the predictive distributions in a sequential factorization of the marginal likelihood in such settings by using stochastic gradient Markov Chain Monte Carlo techniques. This approach is far more efficient than traditional marginal likelihood estimation techniques such as nested sampling and annealed importance sampling due to its use of mini-batches to approximate the likelihood. Stability of the estimates is provided by an adaptive annealing schedule. The resulting stochastic gradient annealed importance sampling (SGAIS) technique, which is the key contribution of our paper, enables us to estimate the marginal likelihood of a number of models considerably faster than traditional approaches, with no noticeable loss of accuracy. An important benefit of our approach is that the marginal likelihood is calculated in an online fashion as data becomes available, allowing the estimates to be used for applications such as online weighted model combination.

annealed importance, dependence, entropy 2019, (15 more...)

arXiv.org Machine Learning

doi: 10.3390/e21111109

1911.07337

Country:

Africa > South Africa (0.04)
North America > United States > Wisconsin > Dane County > Madison (0.04)
North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
(7 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

Add feedback

Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks

Fallah, Alireza, Gurbuzbalaban, Mert, Ozdaglar, Asuman, Simsekli, Umut, Zhu, Lingjiong

arXiv.org Machine LearningNov-15-2019

We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance, robustness to gradient noise and dependence to network effects. When gradients do not contain noise, we also prove that distributed accelerated methods can \emph{achieve acceleration}, requiring $\mathcal{O}(\kappa \log(1/\varepsilon))$ gradient evaluations and $\mathcal{O}(\kappa \log(1/\varepsilon))$ communications to converge to the same fixed point with the non-accelerated variant where $\kappa$ is the condition number and $\varepsilon$ is the target accuracy. To our knowledge, this is the first acceleration result where the iteration complexity scales with the square root of the condition number in the context of \emph{primal} distributed inexact first-order methods. For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of D-ASG with parameters carefully varied over stages to ensure exact $\mathcal{O}(-k/\sqrt{\kappa})$ linear decay in the bias term as well as optimal $\mathcal{O}(\sigma^2/k)$ in the variance term. We illustrate through numerical experiments that our approach results in practical algorithms that are robust to gradient noise and that can outperform existing methods.

algorithm, dasg, nullx, (13 more...)

arXiv.org Machine Learning

1910.08701

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New Jersey > Middlesex County > Piscataway (0.04)
(4 more...)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.73)

Add feedback

Alternatives to the Gradient Descent Algorithm

#artificialintelligenceNov-14-2019, 23:21:55 GMT

Gradient Descent has a problem of getting stuck in Local Minima. The following alternatives are available. The following is a summary of answers suggested on CrossValided, originally posted here. There are many optimization algorithms that operate on a fixed number of real values that are correlated (non-separable). We can divide them roughly in 2 categories: gradient-based optimizers and derivative-free optimizers.

algorithm, convolutional, optimization algorithm, (11 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.75)

Add feedback

Optimal Mini-Batch Size Selection for Fast Gradient Descent

Perrone, Michael P., Khan, Haidar, Kim, Changhoan, Kyrillidis, Anastasios, Quinn, Jerry, Salapura, Valentina

arXiv.org Machine LearningNov-14-2019

Jerry Quinn IBM T.J. Watson Research Center Y orktown Heights, NY 10598 V alentina Salapura IBM T.J. Watson Research Center Y orktown Heights, NY 10598 Abstract This paper presents a methodology for selecting the mini-batch size that minimizes Stochastic Gradient Descent (SGD) learning time for single and multiple learner problems. By de-coupling algorithmic analysis issues from hardware and software implementation details, we reveal a robust empirical inverse law between mini-batch size and the average number of SGD updates required to converge to a specified error threshold. Combining this empirical inverse law with measured system performance, we create an accurate, closed-form model of average training time and show how this model can be used to identify quantifiable implications for both algorithmic and hardware aspects of machine learning. We demonstrate the inverse law empirically, on both image recognition (MNIST, CIFAR10 and CIFAR100) and machine translation (Europarl) tasks, and provide a theoretic justification via proving a novel bound on mini-batch SGD training. Introduction In this paper, we present an empirical law, with theoretical justification, linking the number of learning iterations to the mini-batch size. From this result, we derive a principled methodology for selecting mini-batch size w.r.t. This methodology saves training time and provides both intuition and a principled approach for optimizing machine learning algorithms and machine learning hardware system design. Further, we use our methodology to show that focusing on weak scaling can lead to suboptimal training times because, by neglecting the dependence of convergence time on the size of the mini-batch used, weak scaling does not always minimize the training time.

algorithm, learning, mini-batch size, (14 more...)

arXiv.org Machine Learning

1911.06459

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > United States > Texas > Harris County > Houston (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Federated and Differentially Private Learning for Electronic Health Records

Pfohl, Stephen R., Dai, Andrew M., Heller, Katherine

arXiv.org Machine LearningNov-13-2019

The use of collaborative and decentralized machine learning techniques such as federated learning have the potential to enable the development and deployment of clinical risk predictions models in low-resource settings without requiring sensitive data be shared or stored in a central repository. This process necessitates communication of model weights or updates between collaborating entities, but it is unclear to what extent patient privacy is compromised as a result. To gain insight into this question, we study the efficacy of centralized versus federated learning in both private and non-private settings. The clinical prediction tasks we consider are the prediction of prolonged length of stay and in-hospital mortality across thirty one hospitals in the eICU Collaborative Research Database. We find that while it is straightforward to apply differentially private stochastic gradient descent to achieve strong privacy bounds when training in a centralized setting, it is considerably more difficult to do so in the federated setting.

artificial intelligence, learning, machine learning, (16 more...)

arXiv.org Machine Learning

1911.05861

Country:

North America > United States > Florida > Broward County > Fort Lauderdale (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada (0.04)

Genre: Research Report > Experimental Study (0.69)

Industry:

Information Technology > Security & Privacy (0.88)
Health & Medicine > Health Care Technology > Medical Record (0.86)
Health & Medicine > Health Care Providers & Services (0.62)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

Add feedback

Asymptotics of Reinforcement Learning with Neural Networks

Sirignano, Justin, Spiliopoulos, Konstantinos

arXiv.org Machine LearningNov-13-2019

We prove that a single-layer neural network trained with the Q-learning algorithm converges in distribution to a random ordinary differential equation as the size of the model and the number of training steps become large. Analysis of the limit differential equation shows that it has a unique stationary solution which is the solution of the Bellman equation, thus giving the optimal control for the problem. In addition, we study the convergence of the limit differential equation to the stationary solution. As a by-product of our analysis, we obtain the limiting behavior of single-layer neural networks when trained on i.i.d. data with stochastic gradient descent under the widely-used Xavier initialization.

equation, lemma 5, neural network, (13 more...)

arXiv.org Machine Learning

1911.07304

Country:

North America > United States > New York (0.04)
North America > United States > Illinois (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback