Goto

Collaborating Authors

 Gradient Descent


Three Mechanisms of Weight Decay Regularization

arXiv.org Machine Learning

We empirically investigate weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a variety of network architectures. We identify three distinct mechanisms by which weight decay exerts a regularization effect, depending on the particular optimization algorithm and architecture: (1) increasing the effective learning rate, (2) approximately regularizing the inputoutput Jacobian norm, and (3) reducing the effective damping coefficient for second-order optimization. Our results provide insight into how to improve the regularization of neural networks. Weight decay has long been a standard trick to improve the generalization performance of neural networks (Krogh & Hertz, 1992; Bos & Chug, 1996) by encouraging the weights to be small in magnitude. However, several findings cast doubt on this interpretation: - Weight decay has sometimes been observed to improve training accuracy, not just generalization performance (e.g. In principle, weight decay regularization should have no effect in this case, since one can scale the weights by a small factor without changing the network's predictions. Hence, it does not meaningfully constrain the network's capacity. The effect of weight decay remains poorly understood, and we lack clear guidelines for which tasks and architectures it is likely to help or hurt. A better understanding of the role of weight decay would help us design more efficient and robust neural network architectures.


Kalman Gradient Descent: Adaptive Variance Reduction in Stochastic Optimization

arXiv.org Machine Learning

Stochastic optimization is an essential component of most state-of-the-art the machine learning techniques. Sources of stochasticity in machine learning optimization include handling large datasets, approximating expectations, and modelling uncertain dynamic environments. The seminal work of (Robbins & Monro, 1985) showed that, under certain conditions, it is possible to use gradient-based optimization in the presence of randomness. However, it is well-known that gradient randomness has an adverse effect on the performance of stochastic gradient descent (SGD) (Wang, Chen, Smola, & Xing, 2013). As a result, the construction of methods to reduce gradient noise is an active field of research (Wang et al., 2013; Mandt & Blei, 2014; Grathwohl, Choi, Wu, Roeder, & Duvenaud, 2017; Roeder, Wu, & Duvenaud, 2017).


machine learning .p11 - Little maths behind Gradient Descent. [Hindi]

#artificialintelligence

Hello geeks, this is the 11th video of machine learning tutorial. In this video we'll talk about a very little maths behind gradient descent method, for that you don't need to be a great mathematician, a high school student can easily understand the concept.


Orthogonally Decoupled Variational Gaussian Processes

arXiv.org Machine Learning

Gaussian processes (GPs) provide a powerful non-parametric framework for reasoning over functions. Despite appealing theory, its superlinear computational and memory complexities have presented a long-standing challenge. State-of-the-art sparse variational inference methods trade modeling accuracy against complexity. However, the complexities of these methods still scale superlinearly in the number of basis functions, implying that that sparse GP methods are able to learn from large datasets only when a small model is used. Recently, a decoupled approach was proposed that removes the unnecessary coupling between the complexities of modeling the mean and the covariance functions of a GP. It achieves a linear complexity in the number of mean parameters, so an expressive posterior mean function can be modeled. While promising, this approach suffers from optimization difficulties due to ill-conditioning and non-convexity. In this work, we propose an alternative decoupled parametrization. It adopts an orthogonal basis in the mean function to model the residues that cannot be learned by the standard coupled approach. Therefore, our method extends, rather than replaces, the coupled approach to achieve strictly better performance. This construction admits a straightforward natural gradient update rule, so the structure of the information manifold that is lost during decoupling can be leveraged to speed up learning. Empirically, our algorithm demonstrates significantly faster convergence in multiple experiments.


Gaussian Process Prior Variational Autoencoders

arXiv.org Machine Learning

Variational autoencoders (VAE) are a powerful and widely-used class of models to learn complex data distributions in an unsupervised fashion. One important limitation of VAEs is the prior assumption that latent sample representations are independent and identically distributed. However, for many important datasets, such as time-series of images, this assumption is too strong: accounting for covariances between samples, such as those in time, can yield to a more appropriate model specification and improve performance in downstream tasks. In this work, we introduce a new model, the Gaussian Process (GP) Prior Variational Autoencoder (GPPVAE), to specifically address this issue. The GPPVAE aims to combine the power of VAEs with the ability to model correlations afforded by GP priors. To achieve efficient inference in this new class of models, we leverage structure in the covariance matrix, and introduce a new stochastic backpropagation strategy that allows for computing stochastic gradients in a distributed and low-memory fashion. We show that our method outperforms conditional VAEs (CVAEs) and an adaptation of standard VAEs in two image data applications.


Stein Variational Gradient Descent as Moment Matching

arXiv.org Machine Learning

Stein variational gradient descent (SVGD) is a non-parametric inference algorithm that evolves a set of particles to fit a given distribution of interest. We analyze the non-asymptotic properties of SVGD, showing that there exists a set of functions, which we call the Stein matching set, whose expectations are exactly estimated by any set of particles that satisfies the fixed point equation of SVGD. This set is the image of Stein operator applied on the feature maps of the positive definite kernel used in SVGD. Our results provide a theoretical framework for analyzing the properties of SVGD with different kernels, shedding insight into optimal kernel choice. In particular, we show that SVGD with linear kernels yields exact estimation of means and variances on Gaussian distributions, while random Fourier features enable probabilistic bounds for distributional approximation. Our results offer a refreshing view of the classical inference problem as fitting Stein's identity or solving the Stein equation, which may motivate more efficient algorithms.


Scalable Unbalanced Optimal Transport using Generative Adversarial Networks

arXiv.org Machine Learning

Generative adversarial networks (GANs) are an expressive class of neural generative models with tremendous success in modeling high-dimensional continuous measures. In this paper, we present a scalable method for unbalanced optimal transport (OT) based on the generative-adversarial framework. We formulate unbalanced OT as a problem of simultaneously learning a transport map and a scaling factor that push a source measure to a target measure in a cost-optimal manner. In addition, we propose an algorithm for solving this problem based on stochastic alternating gradient updates, similar in practice to GANs. We also provide theoretical justification for this formulation, showing that it is closely related to an existing static formulation by Liero et al. (2018), and perform numerical experiments demonstrating how this methodology can be applied to population modeling.


SpiderBoost: A Class of Faster Variance-reduced Algorithms for Nonconvex Optimization

arXiv.org Machine Learning

There has been extensive research on developing stochastic variance reduced methods to solve large-scale optimization problems. More recently, a novel algorithm of such a type named SPIDER has been developed in \cite{Fang2018}, which was shown to outperform existing algorithms of the same type and meet the lower bound in certain regimes. Though interesting in theory, SPIDER requires $\epsilon$-level stepsize to guarantee the convergence, and consequently runs slow in practice. This paper proposes SpiderBoost as an improved SPIDER scheme, which comes with two major advantages compared to SPIDER. First, it allows much larger stepsize without sacrificing the convergence rate, and hence runs substantially faster than SPIDER in practice. Second, it extends much more easily to proximal algorithms with guaranteed convergence for solving composite optimization problems, which appears challenging for SPIDER due to stringent requirement on per-iteration increment to guarantee its convergence. Both advantages can be attributed to the new convergence analysis we develop for SpiderBoost that allows much more flexibility for choosing algorithm parameters. As further generalization of SpiderBoost, we show that proximal SpiderBoost achieves a stochastic first-order oracle (SFO) complexity of $\mathcal{O}(\min\{n^{1/2}\epsilon^{-1},\epsilon^{-3/2}\})$ for composite optimization, which improves the existing best results by a factor of $\mathcal{O}(\min\{n^{1/6},\epsilon^{-1/6}\})$.


signProx: One-Bit Proximal Algorithm for Nonconvex Stochastic Optimization

arXiv.org Machine Learning

Stochastic gradient descent (SGD) is one of the most widely used optimization methods for parallel and distributed processing of large datasets. One of the key limitations of distributed SGD is the need to regularly communicate the gradients between different computation nodes. To reduce this communication bottleneck, recent work has considered a one-bit variant of SGD, where only the sign of each gradient element is used in optimization. In this paper, we extend this idea by proposing a stochastic variant of the proximal-gradient method that also uses one-bit per update element. We prove the theoretical convergence of the method for non-convex optimization under a set of explicit assumptions. Our results indicate that the compressed method can match the convergence rate of the uncompressed one, making the proposed method potentially appealing for distributed processing of large datasets.


Stochastic Gradient MCMC for State Space Models

arXiv.org Machine Learning

State space models (SSMs) are a flexible approach to modeling complex time series. However, inference in SSMs is often computationally prohibitive for long time series. Stochastic gradient MCMC (SGMCMC) is a popular method for scalable Bayesian inference for large independent data. Unfortunately when applied to dependent data, such as in SSMs, SGMCMC's stochastic gradient estimates are biased as they break crucial temporal dependencies. To alleviate this, we propose stochastic gradient estimators that control this bias by performing additional computation in a `buffer' to reduce breaking dependencies. Furthermore, we derive error bounds for this bias and show a geometric decay under mild conditions. Using these estimators, we develop novel SGMCMC samplers for discrete, continuous and mixed-type SSMs. Our experiments on real and synthetic data demonstrate the effectiveness of our SGMCMC algorithms compared to batch MCMC, allowing us to scale inference to long time series with millions of time points.