Goto

Collaborating Authors

 Gradient Descent


Combinatorial Losses through Generalized Gradients of Integer Linear Programs

arXiv.org Machine Learning

When samples have internal structure, we often see a mismatch between the objective optimized during training and the model's goal during inference. For example, in sequence-to-sequence modeling we are interested in high-quality translated sentences, but training typically uses maximum likelihood at the word level. Learning to recognize individual faces from group photos, each captioned with the correct but unordered list of people in it, is another example where a mismatch between training and inference objectives occurs. In both cases, the natural training-time loss would involve a combinatorial problem -- dynamic programming-based global sequence alignment and weighted bipartite graph matching, respectively -- but solutions to combinatorial problems are not differentiable with respect to their input parameters, so surrogate, differentiable losses are used instead. Here, we show how to perform gradient descent over combinatorial optimization algorithms that involve continuous parameters, for example edge weights, and can be efficiently expressed as integer, linear, or mixed-integer linear programs. We demonstrate usefulness of gradient descent over combinatorial optimization in sequence-to-sequence modeling using differentiable encoder-decoder architecture with softmax or Gumbel-softmax, and in weakly supervised learning involving a convolutional, residual feed-forward network for image classification.


Nonasymptotic estimates for Stochastic Gradient Langevin Dynamics under local conditions in nonconvex optimization

arXiv.org Machine Learning

Within the context of empirical risk minimization, see Raginsky, Rakhlin, and Telgarsky (2017), we are concerned with a non-asymptotic analysis of sampling algorithms used in optimization. In particular, we obtain non-asymptotic error bounds for a popular class of algorithms called Stochastic Gradient Langevin Dynamics (SGLD). These results are derived in appropriate Wasserstein distances in the absence of the log-concavity of the target distribution. More precisely, the local Lipschitzness of the stochastic gradient $H(\theta, x)$ is assumed, and furthermore, the dissipativity and convexity at infinity condition are relaxed by removing the uniform dependence in $x$.


#003C Gradient Descent in Python Master Data Science

#artificialintelligence

We will first import libraries as NumPy, matplotlib, pyplot and derivative function. Then with a NumPy function โ€“ linspace() we define our variable \(w \) domain between 1.0 and 5.0 and 100 points. Also we define alpha which will represent learning rate. Next, we will define our \(y \) ( in our case \(J(w) \)) and plot to see a convex function, we will use \((w-3) 2 \). So we can see that we plotted our convex function as an example.


A Double Residual Compression Algorithm for Efficient Distributed Learning

arXiv.org Machine Learning

Large-scale machine learning models are often trained by parallel stochastic gradient descent algorithms. However, the communication cost of gradient aggregation and model synchronization between the master and worker nodes becomes the major obstacle for efficient learning as the number of workers and the dimension of the model increase. In this paper, we propose DORE, a DOuble REsidual compression stochastic gradient descent algorithm, to reduce over $95\%$ of the overall communication such that the obstacle can be immensely mitigated. Our theoretical analyses demonstrate that the proposed strategy has superior convergence properties for both strongly convex and nonconvex objective functions. The experimental results validate that DORE achieves the best communication efficiency while maintaining similar model accuracy and convergence speed in comparison with start-of-the-art baselines.


On Solving Minimax Optimization Locally: A Follow-the-Ridge Approach

arXiv.org Machine Learning

Many tasks in modern machine learning can be formulated as finding equilibria in \emph{sequential} games. In particular, two-player zero-sum sequential games, also known as minimax optimization, have received growing interest. It is tempting to apply gradient descent to solve minimax optimization given its popularity and success in supervised learning. However, it has been noted that naive application of gradient descent fails to find some local minimax and can converge to non-local-minimax points. In this paper, we propose \emph{Follow-the-Ridge} (FR), a novel algorithm that provably converges to and only converges to local minimax. We show theoretically that the algorithm addresses the notorious rotational behaviour of gradient dynamics, and is compatible with preconditioning and \emph{positive} momentum. Empirically, FR solves toy minimax problems and improves the convergence of GAN training compared to the recent minimax optimization algorithms.


Orthogonal Gradient Descent for Continual Learning

arXiv.org Machine Learning

Orthogonal Gradient Descent for Continual LearningMehrdad Farajtabar Navid Azizan 1 Alex Mott Ang Li DeepMind CalTech DeepMind DeepMind Abstract Neural networks are achieving state of the art and sometimes superhuman performance on learning tasks across a variety of domains. Whenever these problems require learning in a continual or sequential manner, however, neural networks suffer from the problem of catastrophic forgetting; they forget how to solve previous tasks after being trained on a new task, despite having the essential capacity to solve both tasks if they were trained on both simultaneously. In this paper, we propose to address this issue from a parameter space perspective and study an approach to restrict the direction of the gradient updates to avoid forgetting previously-learned data. We present the Orthogonal Gradient Descent (OGD) method, which accomplishes this goal by projecting the gradients from new tasks onto a subspace in which the neural network output on previous task does not change and the projected gradient is still in a useful direction for learning the new task. Our approach utilizes the high capacity of a neural network more efficiently and does not require storing the previously learned data that might raise privacy concerns. Experiments on common benchmarks reveal the effectiveness of the proposed OGD method. 1 Introduction One critical component of intelligence is the ability to learn continuously, when new information is constantly available but previously presented information is unavailable to retrieve. Despite their ubiquity in the real world, these problems have posed a longstanding challenge to artificial intelligence (Thrun and Mitchell, 1995; Hassabis et al., 2017).Correspondence to farajtabar@google.com. 1 Work done during an internship at DeepMind. A typical neural network training procedure over a sequence of different tasks usually results in degraded performance on previously trained tasks if the model could not revisit the data of previous tasks. This phenomenon is called catastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990; French, 1999). Ideally, an intelligent agent should be able to learn consecutive tasks without degrading its performance on those already learned. With the deep learning renaissance (Krizhevsky et al., 2012; Hinton et al., 2006; Si-monyan and Zisserman, 2014) this problem has been revived (Srivastava et al., 2013; Goodfellow et al., 2013) with many followup studies (Parisi et al., 2019). One probable reason for this phenomenon is that neural networks are usually trained by Stochastic Gradient Descent (SGD)--or its variants--where the op-timizers produce gradients that are oblivious to past knowledge.


SGD Learns One-Layer Networks in WGANs

arXiv.org Machine Learning

Generative adversarial networks (GANs) are a widely used framework for learning generative models. Wasserstein GANs (WGANs), one of the most successful variants of GANs, require solving a minmax optimization problem to global optimality, but are in practice successfully trained using stochastic gradient descent-ascent. In this paper, we show that, when the generator is a one-layer network, stochastic gradient descent-ascent converges to a global solution with polynomial time and sample complexity.


The Local Elasticity of Neural Networks

arXiv.org Machine Learning

This paper presents a phenomenon in neural networks that we refer to as \textit{local elasticity}. Roughly speaking, a classifier is said to be locally elastic if its prediction at a feature vector $\bx'$ is \textit{not} significantly perturbed, after the classifier is updated via stochastic gradient descent at a (labeled) feature vector $\bx$ that is \textit{dissimilar} to $\bx'$ in a certain sense. This phenomenon is shown to persist for neural networks with nonlinear activation functions through extensive simulations on real-life and synthetic datasets, whereas this is not observed in linear classifiers. In addition, we offer a geometric interpretation of local elasticity using the neural tangent kernel \citep{jacot2018neural}. Building on top of local elasticity, we obtain pairwise similarity measures between feature vectors, which can be used for clustering in conjunction with $K$-means. The effectiveness of the clustering algorithm on the MNIST and CIFAR-10 datasets in turn corroborates the hypothesis of local elasticity of neural networks on real-life data. Finally, we discuss some implications of local elasticity to shed light on several intriguing aspects of deep neural networks.


Extracting robust and accurate features via a robust information bottleneck

arXiv.org Machine Learning

We propose a novel strategy for extracting features in supervised learning that can be used to construct a classifier which is more robust to small perturbations in the input space. Our method builds upon the idea of the information bottleneck by introducing an additional penalty term that encourages the Fisher information of the extracted features to be small, when parametrized by the inputs. By tuning the regularization parameter, we can explicitly trade off the opposing desiderata of robustness and accuracy when constructing a classifier. We derive the optimal solution to the robust information bottleneck when the inputs and outputs are jointly Gaussian, proving that the optimally robust features are also jointly Gaussian in that setting. Furthermore, we propose a method for optimizing a variational bound on the robust information bottleneck objective in general settings using stochastic gradient descent, which may be implemented efficiently in neural networks. Our experimental results for synthetic and real data sets show that the proposed feature extraction method indeed produces classifiers with increased robustness to perturbations.


On Higher-order Moments in Adam

arXiv.org Machine Learning

In this paper, we investigate the popular deep learning optimization routine, Adam, from the perspective of statistical moments. While Adam is an adaptive lower-order moment based (of the stochastic gradient) method, we propose an extension namely, HAdam, which uses higher order moments of the stochastic gradient. Our analysis and experiments reveal that certain higher-order moments of the stochastic gradient are able to achieve better performance compared to the vanilla Adam algorithm. We also provide some analysis of HAdam related to odd and even moments to explain some intriguing and seemingly non-intuitive empirical results.