Goto

Collaborating Authors

 Gradient Descent


A Comprehensive Guide to Stochastic Gradient Descent Algorithms

#artificialintelligence

Unfortunately, the reality is a little bit different, in particular in deep models, where the number of parameters is in the order of ten or one hundred million. When the system is relatively shallow, it's easier to find local minima where the training process can stop, while in deeper models, the probability of a local minimum becomes smaller and, instead, saddle points become more and more likely.


WITCHcraft: Efficient PGD attacks with random step size

arXiv.org Machine Learning

State-of-the-art adversarial attacks on neural networks use expensive iterative methods and numerous random restarts from different initial points. Iterative FGSM-based methods without restarts trade off performance for computational efficiency because they do not adequately explore the image space and are highly sensitive to the choice of step size. We propose a variant of Projected Gradient Descent (PGD) that uses a random step size to improve performance without resorting to expensive random restarts. Our method, Wide Iterative Stochastic crafting (WITCHcraft), achieves results superior to the classical PGD attack on the CIFAR-10 and MNIST data sets but without additional computational cost. This simple modification of PGD makes crafting attacks more economical, which is important in situations like adversarial training where attacks need to be crafted in real time.


vqSGD: Vector Quantized Stochastic Gradient Descent

arXiv.org Machine Learning

For any c R d, and r 0, let B d(c, r) denote a d -dimensional null 2 ball of radius r centered at c . Let e i R d denote the i -th standard basis vector which has 1 in the i -th position and 0 everywhere else. Also, let 1 d and 0 d denote the all 1's vector and all 0's vector in R d respectively. By [n ] we denote the set { 1, 2,..., n } . For a discrete set of points C R d, let conv (C) denote the convex hull of points in C, i.e.,, conv(C): null null c C a cc a c 0, null c C a c 1 null . Suppose w R d be the parameters of a function to be learned (such as weights of a neural network). In each step of the SGD algorithm, the parameters are updated as w w ฮท ห† ะด, where ฮท is a possibly time-varying learning rate and ห† ะด is a stochastic unbiased estimate of ะด, the true gradient of some loss function with respect to w .


A Graph Autoencoder Approach to Causal Structure Learning

arXiv.org Machine Learning

Causal structure learning has been a challenging task in the past decades and several mainstream approaches such as constraint- and score-based methods have been studied with theoretical guarantees. Recently, a new approach has transformed the combinatorial structure learning problem into a continuous one and then solved it using gradient-based optimization methods. Following the recent state-of-the-arts, we propose a new gradient-based method to learn causal structures from observational data. The proposed method generalizes the recent gradient-based methods to a graph autoencoder framework that allows nonlinear structural equation models and is easily applicable to vector-valued variables. We demonstrate that on synthetic datasets, our proposed method outperforms other gradient-based methods significantly, especially on large causal graphs. We further investigate the scalability and efficiency of our method, and observe a near linear training time when scaling up the graph size.


Stochastic Gradient Annealed Importance Sampling for Efficient Online Marginal Likelihood Estimation

arXiv.org Machine Learning

We consider estimating the marginal likelihood in settings with independent and identically distributed (i.i.d.) data. We propose estimating the predictive distributions in a sequential factorization of the marginal likelihood in such settings by using stochastic gradient Markov Chain Monte Carlo techniques. This approach is far more efficient than traditional marginal likelihood estimation techniques such as nested sampling and annealed importance sampling due to its use of mini-batches to approximate the likelihood. Stability of the estimates is provided by an adaptive annealing schedule. The resulting stochastic gradient annealed importance sampling (SGAIS) technique, which is the key contribution of our paper, enables us to estimate the marginal likelihood of a number of models considerably faster than traditional approaches, with no noticeable loss of accuracy. An important benefit of our approach is that the marginal likelihood is calculated in an online fashion as data becomes available, allowing the estimates to be used for applications such as online weighted model combination.


Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks

arXiv.org Machine Learning

We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are available. We develop a framework which allows to choose the stepsize and the momentum parameters of these algorithms in a way to optimize performance by systematically trading off the bias, variance, robustness to gradient noise and dependence to network effects. When gradients do not contain noise, we also prove that distributed accelerated methods can \emph{achieve acceleration}, requiring $\mathcal{O}(\kappa \log(1/\varepsilon))$ gradient evaluations and $\mathcal{O}(\kappa \log(1/\varepsilon))$ communications to converge to the same fixed point with the non-accelerated variant where $\kappa$ is the condition number and $\varepsilon$ is the target accuracy. To our knowledge, this is the first acceleration result where the iteration complexity scales with the square root of the condition number in the context of \emph{primal} distributed inexact first-order methods. For quadratic functions, we also provide finer performance bounds that are tight with respect to bias and variance terms. Finally, we study a multistage version of D-ASG with parameters carefully varied over stages to ensure exact $\mathcal{O}(-k/\sqrt{\kappa})$ linear decay in the bias term as well as optimal $\mathcal{O}(\sigma^2/k)$ in the variance term. We illustrate through numerical experiments that our approach results in practical algorithms that are robust to gradient noise and that can outperform existing methods.


Alternatives to the Gradient Descent Algorithm

#artificialintelligence

Gradient Descent has a problem of getting stuck in Local Minima. The following alternatives are available. The following is a summary of answers suggested on CrossValided, originally posted here. There are many optimization algorithms that operate on a fixed number of real values that are correlated (non-separable). We can divide them roughly in 2 categories: gradient-based optimizers and derivative-free optimizers.


Optimal Mini-Batch Size Selection for Fast Gradient Descent

arXiv.org Machine Learning

Jerry Quinn IBM T.J. Watson Research Center Y orktown Heights, NY 10598 V alentina Salapura IBM T.J. Watson Research Center Y orktown Heights, NY 10598 Abstract This paper presents a methodology for selecting the mini-batch size that minimizes Stochastic Gradient Descent (SGD) learning time for single and multiple learner problems. By de-coupling algorithmic analysis issues from hardware and software implementation details, we reveal a robust empirical inverse law between mini-batch size and the average number of SGD updates required to converge to a specified error threshold. Combining this empirical inverse law with measured system performance, we create an accurate, closed-form model of average training time and show how this model can be used to identify quantifiable implications for both algorithmic and hardware aspects of machine learning. We demonstrate the inverse law empirically, on both image recognition (MNIST, CIFAR10 and CIFAR100) and machine translation (Europarl) tasks, and provide a theoretic justification via proving a novel bound on mini-batch SGD training. Introduction In this paper, we present an empirical law, with theoretical justification, linking the number of learning iterations to the mini-batch size. From this result, we derive a principled methodology for selecting mini-batch size w.r.t. This methodology saves training time and provides both intuition and a principled approach for optimizing machine learning algorithms and machine learning hardware system design. Further, we use our methodology to show that focusing on weak scaling can lead to suboptimal training times because, by neglecting the dependence of convergence time on the size of the mini-batch used, weak scaling does not always minimize the training time.


Federated and Differentially Private Learning for Electronic Health Records

arXiv.org Machine Learning

The use of collaborative and decentralized machine learning techniques such as federated learning have the potential to enable the development and deployment of clinical risk predictions models in low-resource settings without requiring sensitive data be shared or stored in a central repository. This process necessitates communication of model weights or updates between collaborating entities, but it is unclear to what extent patient privacy is compromised as a result. To gain insight into this question, we study the efficacy of centralized versus federated learning in both private and non-private settings. The clinical prediction tasks we consider are the prediction of prolonged length of stay and in-hospital mortality across thirty one hospitals in the eICU Collaborative Research Database. We find that while it is straightforward to apply differentially private stochastic gradient descent to achieve strong privacy bounds when training in a centralized setting, it is considerably more difficult to do so in the federated setting.


Asymptotics of Reinforcement Learning with Neural Networks

arXiv.org Machine Learning

We prove that a single-layer neural network trained with the Q-learning algorithm converges in distribution to a random ordinary differential equation as the size of the model and the number of training steps become large. Analysis of the limit differential equation shows that it has a unique stationary solution which is the solution of the Bellman equation, thus giving the optimal control for the problem. In addition, we study the convergence of the limit differential equation to the stationary solution. As a by-product of our analysis, we obtain the limiting behavior of single-layer neural networks when trained on i.i.d. data with stochastic gradient descent under the widely-used Xavier initialization.