Goto

Collaborating Authors

 Gradient Descent


Implicit competitive regularization in GANs

arXiv.org Machine Learning

Generative adversarial networks (GANs) are capable of producing high quality samples, but they suffer from numerous issues such as instability and mode collapse during training. To combat this, we propose to model the generator and discriminator as agents acting under local information, uncertainty, and awareness of their opponent. By doing so we achieve stable convergence, even when the underlying game has no Nash equilibria. We call this mechanism implicit competitive regularization (ICR) and show that it is present in the recently proposed competitive gradient descent (CGD). When comparing CGD to Adam using a variety of loss functions and regularizers on CIFAR10, CGD shows a much more consistent performance, which we attribute to ICR. In our experiments, we achieve the highest inception score when using the WGAN loss (without gradient penalty or weight clipping) together with CGD. This can be interpreted as minimizing a form of integral probability metric based on ICR. Generative adversarial networks (GANs): (Goodfellow et al., 2014) are a class of generative models based on a competitive game between a generator that tries to generate realistic new data, and a discriminator, that tries to distinguish generated from real data.


One Sample Stochastic Frank-Wolfe

arXiv.org Machine Learning

One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications. However, once we replace the projection operator with a simpler linear program, as is done in the Frank-Wolfe method, both simplicity and stability take a serious hit. The aim of this paper is to bring them back without sacrificing the efficiency. In this paper, we propose the first one-sample stochastic Frank-Wolfe algorithm, called 1-SFW, that avoids the need to carefully tune the batch size, step size, learning rate, and other complicated hyper parameters. In particular, 1-SFW achieves the optimal convergence rate of $\mathcal{O}(1/\epsilon^2)$ for reaching an $\epsilon$-suboptimal solution in the stochastic convex setting, and a $(1-1/e)-\epsilon$ approximate solution for a stochastic monotone DR-submodular maximization problem. Moreover, in a general non-convex setting, 1-SFW finds an $\epsilon$-first-order stationary point after at most $\mathcal{O}(1/\epsilon^3)$ iterations, achieving the current best known convergence rate. All of this is possible by designing a novel unbiased momentum estimator that governs the stability of the optimization process while using a single sample at each iteration.


Accelerating Federated Learning via Momentum Gradient Descent

arXiv.org Machine Learning

Federated learning (FL) provides a communication-efficient approach to solve machine learning problems concerning distributed data, without sending raw data to a central server. However, existing works on FL only utilize first-order gradient descent (GD) and do not consider the preceding iterations to gradient update which can potentially accelerate convergence. In this paper, we consider momentum term which relates to the last iteration. The proposed momentum federated learning (MFL) uses momentum gradient descent (MGD) in the local update step of FL system. We establish global convergence properties of MFL and derive an upper bound on MFL convergence rate. Comparing the upper bounds on MFL and FL convergence rate, we provide conditions in which MFL accelerates the convergence. For different machine learning models, the convergence performance of MFL is evaluated based on experiments with MNIST dataset. Simulation results comfirm that MFL is globally convergent and further reveal significant convergence improvement over FL.


Is Rectified Adam actually *better* than Adam? - PyImageSearch

#artificialintelligence

Is the Rectified Adam (RAdam) optimizer actually better than the standard Adam optimizer? According to my 24 experiments, the answer is no, typically not (but there are cases where you do want to use it instead of Adam). In Liu et al.'s 2018 paper, On the Variance of the Adaptive Learning Rate and Beyond, the authors claim that Rectified Adam can obtain: The authors tested their hypothesis on three different datasets, including one NLP dataset and two computer vision datasets (ImageNet and CIFAR-10). In each case Rectified Adam outperformed standard Adam…but failed to outperform standard Stochastic Gradient Descent (SGD)! The Rectified Adam optimizer has some strong theoretical justifications -- but as a deep learning practitioner, you need more than just theory -- you need to see empirical results applied to a variety of datasets. And perhaps more importantly, you need to obtain a mastery level experience operating/driving the optimizer (or a small subset of optimizers) as well. If you haven't yet, go ahead and read part one to ensure you have a good understanding of how the Rectified Adam optimizer works. From there, read today's post to help you understand how to design, code, and run experiments used to compare deep learning optimizers. To learn how to compare Rectified Adam to standard Adam, just keep reading! In the first part of this tutorial, we'll briefly discuss the Rectified Adam optimizer, including how it works and why it's interesting to us as deep learning practitioners.


Differentiable Sparsification for Deep Neural Networks

arXiv.org Machine Learning

A deep neural network has relieved the burden of feature engineering by human experts, but comparable efforts are instead required to determine an effective architecture. On the other hands, as the size of a network has over-grown, a lot of resources are also invested to reduce its size. These problems can be addressed by sparsification of an over-complete model, which removes redundant parameters or connections by pruning them away after training or encouraging them to become zero during training. In general, however, these approaches are not fully differentiable and interrupt an end-to-end training process with the stochastic gradient descent in that they require either a parameter selection or a soft-thresholding step. In this paper, we propose a fully differentiable sparsification method for deep neural networks, which allows parameters to be exactly zero during training, and thus can learn the sparsified structure and the weights of networks simultaneously using the stochastic gradient descent. We apply the proposed method to various popular models in order to show its effectiveness.


FIS-GAN: GAN with Flow-based Importance Sampling

arXiv.org Machine Learning

Generative Adversarial Networks (GAN) training process, in most cases, apply uniform and Gaussian sampling methods in latent space, which probably spends most of the computation on examples that can be properly handled and easy to generate. Theoretically, importance sampling speeds up stochastic gradient algorithms for supervised learning by prioritizing training examples. In this paper, we explore the possibility for adapting importance sampling into adversarial learning. We use importance sampling to replace uniform and Gaussian sampling methods in latent space and combine normalizing flow with importance sampling to approximate latent space posterior distribution by density estimation. Empirically, results on MNIST and Fashion-MNIST demonstrate that our method significantly accelerates the convergence of generative process while retaining visual fidelity in generated samples.


On the convergence of gradient descent for two layer neural networks

arXiv.org Machine Learning

It has been shown that gradient descent can yield the zero training loss in the over-parametrized regime (the width of the neural networks is much larger than the number of data points). In this work, combining the ideas of some existing works, we investigate the gradient descent method for training two-layer neural networks for approximating some target continuous functions. By making use the generic chaining technique from probability theory, we show that gradient descent can yield an exponential convergence rate, while the width of the neural networks needed is independent of the size of the training data. The result also implies some strong approximation ability of the two-layer neural networks without curse of dimensionality.


ES-MAML: Simple Hessian-Free Meta Learning

arXiv.org Artificial Intelligence

Meta-learning is a paradigm in machine learning which aims to develop models and training algorithms which can quickly adapt to new tasks and data. Our focus in this paper is on meta-learning in reinforcement learning (RL), where data efficiency is of paramount importance because gathering new samples often requires costly simulations or interactions with the real world. A popular technique for RL meta-learning is Model Agnostic Meta Learning (MAML) (Finn et al., 2017, 2018), a model for training an agent (the meta-policy) which can quickly adapt to new and unknown tasks by performing one (or a few) gradient updates in the new environment. We provide a formal description of MAML in Section 2. MAML has proven to be successful for many applications. However, implementing and running MAML continues to be challenging. One major complication is that the standard version of MAML requires estimating second derivatives of the RL reward function, which is difficult when using backpropagation on stochastic policies; indeed, the original implementation of MAML (Finn et al., 2017) did so incorrectly, which spurred the development of unbiased higher-order estimators (DiCE, (Foerster et al., 2018)) and further analysis of the credit assignment mechanism in MAML (Rothfuss et al., 2019).


Distributed Learning of Deep Neural Networks using Independent Subnet Training

arXiv.org Machine Learning

Stochastic gradient descent (SGD) is the method of choice for distributed machine learning, by virtue of its light complexity per iteration on compute nodes, leading to almost linear speedups in theory. Nevertheless, such speedups are rarely observed in practice, due to high communication overheads during synchronization steps. We alleviate this problem by introducing independent subnet training: a simple, jointly model-parallel and data-parallel, approach to distributed training for fully connected, feed-forward neural networks. During subnet training, neurons are stochastically partitioned without replacement, and each partition is sent only to a single worker. This reduces the overall synchronization overhead, as each worker only receives the weights associated with the subnetwork it has been assigned to. Subnet training also reduces synchronization frequency: since workers train disjoint portions of the network, the training can proceed for long periods of time before synchronization, similar to local SGD approaches. We empirically evaluate our approach on real-world speech recognition and product recommendation applications, where we observe that subnet training i) results into accelerated training times, as compared to state of the art distributed models, and ii) often results into boosting the testing accuracy, as it implicitly combines dropout and batch normalization regularizations during training.


The Complexity of Finding Stationary Points with Stochastic Gradient Descent

arXiv.org Machine Learning

We study the iteration complexity of stochastic gradient descent (SGD) for minimizing the gradient norm of smooth, possibly nonconvex functions. We provide several results, implying that the classical $\mathcal{O}(\epsilon^{-4})$ upper bound (for making the average gradient norm less than $\epsilon$) cannot be improved upon, unless a combination of additional assumptions is made. Notably, this holds even if we limit ourselves to convex quadratic functions. We also show that for nonconvex functions, the feasibility of minimizing gradients with SGD is surprisingly sensitive to the choice of optimality criteria.