Goto

Collaborating Authors

 Gradient Descent


Stochastic optimization with momentum: convergence, fluctuations, and traps avoidance

arXiv.org Machine Learning

In this paper, a general stochastic optimization procedure is studied, unifying several variants of the stochastic gradient descent such as, among others, the stochastic heavy ball method, the Stochastic Nesterov Accelerated Gradient algorithm (S-NAG), and the widely used Adam algorithm. The algorithm is seen as a noisy Euler discretization of a non-autonomous ordinary differential equation, recently introduced by Belotto da Silva and Gazeau, which is analyzed in depth. Assuming that the objective function is non-convex and differentiable, the stability and the almost sure convergence of the iterates to the set of critical points are established. A noteworthy special case is the convergence proof of S-NAG in a non-convex setting. Under some assumptions, the convergence rate is provided under the form of a Central Limit Theorem. Finally, the non-convergence of the algorithm to undesired critical points, such as local maxima or saddle points, is established. Here, the main ingredient is a new avoidance of traps result for non-autonomous settings, which is of independent interest.


DiffPrune: Neural Network Pruning with Deterministic Approximate Binary Gates and $L_0$ Regularization

arXiv.org Machine Learning

Modern neural network architectures typically have many millions of parameters and can be pruned significantly without substantial loss in effectiveness which demonstrates they are over-parameterized. The contribution of this work is two-fold. The first is a method for approximating a multivariate Bernoulli random variable by means of a deterministic and differentiable transformation of any real-valued multivariate random variable. The second is a method for model selection by element-wise multiplication of parameters with approximate binary gates that may be computed deterministically or stochastically and take on exact zero values. Sparsity is encouraged by the inclusion of a surrogate regularization to the $L_0$ loss. Since the method is differentiable it enables straightforward and efficient learning of model architectures by an empirical risk minimization procedure with stochastic gradient descent and theoretically enables conditional computation during training. The method also supports any arbitrary group sparsity over parameters or activations and therefore offers a framework for unstructured or flexible structured model pruning. To conclude experiments are performed to demonstrate the effectiveness of the proposed approach.


How to Manually Optimize Neural Network Models

#artificialintelligence

Deep learning neural network models are fit on training data using the stochastic gradient descent optimization algorithm. Updates to the weights of the model are made, using the backpropagation of error algorithm. The combination of the optimization and weight update algorithm was carefully chosen and is the most efficient approach known to fit neural networks. Nevertheless, it is possible to use alternate optimization algorithms to fit a neural network model to a training dataset. This can be a useful exercise to learn more about how neural networks function and the central nature of optimization in applied machine learning. It may also be required for neural networks with unconventional model architectures and non-differentiable transfer functions.


When does gradient descent with logistic loss find interpolating two-layer networks?

arXiv.org Machine Learning

The success of deep learning models has led to a lot of recent interest in understanding the properties of "interpolating" neural network models, that achieve (near-)zero training loss [Zha 17a; Bel 19]. One aspect of understanding these models is to theoretically characterize how first-order gradient methods (with appropriate random initialization) seem to reliably find interpolating solutions to non-convex optimization problems. In this paper, we show that, under two sets of conditions, training fixed-width two-layer networks with gradient descent drives the logistic loss to zero. The networks have smooth "Huberized" ReLUs [Tat 20, see (1) and Figure 1] and the output weights are not trained. The first result only requires the assumption that the initial loss is small, but does not require any assumption about either the width of the network or the number of samples. It guarantees that if the initial loss is small then gradient descent drives the logistic loss to zero. For our second result we assume that the inputs come from four clusters, two per class, and that the clusters corresponding to the opposite labels are appropriately separated. Under these assumptions, we show that random Gaussian initialization along with a single step of gradient descent is enough to guarantee that the loss reduces sufficiently that the first result applies. A few proof ideas that facilitate our results are as follows: under our first set of assumptions, when the loss is small, we show that the negative gradient aligns well with the parameter vector. 1


Training your Neural Network with Cyclical Learning Rates โ€“ MachineCurve

#artificialintelligence

At a high level, training supervised machine learning models involves a few easy steps: feeding data to your model, computing loss based on the differences between predictions and ground truth, and using loss to improve the model with an optimizer. For example, it's possible to choose multiple optimizers โ€“ ranging from traditional Stochastic Gradient Descent to adaptive optimizers, which are also very common today. Say that you settle for the first โ€“ Stochastic Gradient Descent (SGD). Likely, in your deep learning framework, you'll see that the learning rate is a parameter that can be configured, with a default value that is preconfigured most of the times. Now, what is this learning rate? Why do we need them?


SSGD: A safe and efficient method of gradient descent

arXiv.org Artificial Intelligence

With the vigorous development of artificial intelligence technology, various engineering technology applications have been implemented one after another. The gradient descent method plays an important role in solving various optimization problems, due to its simple structure, good stability and easy implementation. In multi-node machine learning system, the gradients usually need to be shared. Data reconstruction attacks can reconstruct training data simply by knowing the gradient information. In this paper, to prevent gradient leakage while keeping the accuracy of model, we propose the super stochastic gradient descent approach to update parameters by concealing the modulus length of gradient vectors and converting it or them into a unit vector. Furthermore, we analyze the security of stochastic gradient descent approach. Experiment results show that our approach is obviously superior to prevalent gradient descent approaches in terms of accuracy and robustness.


Second-Order Guarantees in Federated Learning

arXiv.org Machine Learning

Federated learning is a useful framework for centralized learning from distributed data under practical considerations of heterogeneity, asynchrony, and privacy. Federated architectures are frequently deployed in deep learning settings, which generally give rise to non-convex optimization problems. Nevertheless, most existing analysis are either limited to convex loss functions, or only establish first-order stationarity, despite the fact that saddle-points, which are first-order stationary, are known to pose bottlenecks in deep learning. We draw on recent results on the second-order optimality of stochastic gradient algorithms in centralized and decentralized settings, and establish second-order guarantees for a class of federated learning algorithms.


rTop-k: A Statistical Estimation Approach to Distributed SGD

arXiv.org Machine Learning

The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k and random-k shown to be particularly effective. The same observation has also motivated a separate line of work in distributed statistical estimation theory focusing on the impact of communication constraints on the estimation efficiency of different statistical models. The primary goal of this paper is to connect these two research lines and demonstrate how statistical estimation models and their analysis can lead to new insights in the design of communication-efficient training techniques. We propose a simple statistical estimation model for the stochastic gradients which captures the sparsity and skewness of their distribution. The statistically optimal communication scheme arising from the analysis of this model leads to a new sparsification technique for SGD, which concatenates random-k and top-k, considered separately in the prior literature. We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the concatenated application of these two sparsification methods consistently and significantly outperforms either method applied alone.


Gradient Sparsification Can Improve Performance of Differentially-Private Convex Machine Learning

arXiv.org Machine Learning

We use gradient sparsification to reduce the adverse effect of differential privacy noise on performance of private machine learning models. To this aim, we employ compressed sensing and additive Laplace noise to evaluate differentially-private gradients. Noisy privacy-preserving gradients are used to perform stochastic gradient descent for training machine learning models. Sparsification, achieved by setting the smallest gradient entries to zero, can reduce the convergence speed of the training algorithm. However, by sparsification and compressed sensing, the dimension of communicated gradient and the magnitude of additive noise can be reduced. The interplay between these effects determines whether gradient sparsification improves the performance of differentially-private machine learning models. We investigate this analytically in the paper. We prove that, for small privacy budgets, compression can improve performance of privacy-preserving machine learning models. However, for large privacy budgets, compression does not necessarily improve the performance. Intuitively, this is because the effect of privacy-preserving noise is minimal in large privacy budget regime and thus improvements from gradient sparsification cannot compensate for its slower convergence.


Gradient Descent, clearly explained in Python, Part 1: The troubling theory.

#artificialintelligence

If you have ever done a Kaggle competition, these would be commonly referred to as evaluation metrics. Typically, the lower the loss, the better the performance of your model. So if,for example, you were predicting house prices and using Mean Squared Error, and your cost was $25000, that means that your model is performing poorly as it is making a prediction error of $25000. Going back to our analogy, if you imagine that instead of a mountain there is a U-shaped curve, and instead of a person there is the cost function with maybe an initial cost value of 25,500. The aim of Gradient Descent would be to minimise this cost to either 0(global minimum), or something much smaller(local minimum).