Gradient Descent
On Scalable Inference with Stochastic Gradient Descent
Fang, Yixin, Xu, Jinfeng, Yang, Lei
In many applications involving large dataset or online updating, stochastic gradient descent (SGD) provides a scalable way to compute parameter estimates and has gained increasing popularity due to its numerical convenience and memory efficiency. While the asymptotic properties of SGD-based estimators have been established decades ago, statistical inference such as interval estimation remains much unexplored. The traditional resampling method such as the bootstrap is not computationally feasible since it requires to repeatedly draw independent samples from the entire dataset. The plug-in method is not applicable when there are no explicit formulas for the covariance matrix of the estimator. In this paper, we propose a scalable inferential procedure for stochastic gradient descent, which, upon the arrival of each observation, updates the SGD estimate as well as a large number of randomly perturbed SGD estimates. The proposed method is easy to implement in practice. We establish its theoretical properties for a general class of models that includes generalized linear models and quantile regression models as special cases. The finite-sample performance and numerical utility is evaluated by simulation studies and two real data applications.
Probabilistic Line Searches for Stochastic Optimization
Mahsereci, Maren, Hennig, Philipp
In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic methods with notions from Bayesian optimization. Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. The algorithm has very low computational cost, and no user-controlled parameters. Experiments show that it effectively removes the need to define a learning rate for stochastic gradient descent.
Coupling Adaptive Batch Sizes with Learning Rates
Balles, Lukas, Romero, Javier, Hennig, Philipp
Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, stability and convergence is thus often enforced by means of a (manually tuned) decreasing learning rate schedule. We propose a practical method for dynamic batch size adaptation. It estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease. In contrast to recent related work, our algorithm couples the batch size to the learning rate, directly reflecting the known relationship between the two. On popular image classification benchmarks, our batch size adaptation yields faster optimization convergence, while simultaneously simplifying learning rate tuning. A TensorFlow implementation is available.
On Sampling Strategies for Neural Network-based Collaborative Filtering
Chen, Ting, Sun, Yizhou, Shi, Yue, Hong, Liangjie
Recent advances in neural networks have inspired people to design hybrid recommendation algorithms that can incorporate both (1) user-item interaction information and (2) content information including image, audio, and text. Despite their promising results, neural network-based recommendation algorithms pose extensive computational costs, making it challenging to scale and improve upon. In this paper, we propose a general neural network-based recommendation framework, which subsumes several existing state-of-the-art recommendation algorithms, and address the efficiency issue by investigating sampling strategies in the stochastic gradient descent training for the framework. We tackle this issue by first establishing a connection between the loss functions and the user-item interaction bipartite graph, where the loss function terms are defined on links while major computation burdens are located at nodes. We call this type of loss functions "graph-based" loss functions, for which varied mini-batch sampling strategies can have different computational costs. Based on the insight, three novel sampling strategies are proposed, which can significantly improve the training efficiency of the proposed framework (up to $\times 30$ times speedup in our experiments), as well as improving the recommendation performance. Theoretical analysis is also provided for both the computational cost and the convergence. We believe the study of sampling strategies have further implications on general graph-based loss functions, and would also enable more research under the neural network-based recommendation framework.
Scalable Multi-Class Gaussian Process Classification using Expectation Propagation
Villacampa-Calvo, Carlos, Hernรกndez-Lobato, Daniel
This paper describes an expectation propagation (EP) method for multi-class classification with Gaussian processes that scales well to very large datasets. In such a method the estimate of the log-marginal-likelihood involves a sum across the data instances. This enables efficient training using stochastic gradients and mini-batches. When this type of training is used, the computational cost does not depend on the number of data instances $N$. Furthermore, extra assumptions in the approximate inference process make the memory cost independent of $N$. The consequence is that the proposed EP method can be used on datasets with millions of instances. We compare empirically this method with alternative approaches that approximate the required computations using variational inference. The results show that it performs similar or even better than these techniques, which sometimes give significantly worse predictive distributions in terms of the test log-likelihood. Besides this, the training process of the proposed approach also seems to converge in a smaller number of iterations.
Statistical Mechanics of Node-perturbation Learning with Noisy Baseline
Hara, Kazuyuki, Katahira, Kentaro, Okada, Masato
Node-perturbation learning is a type of statistical gradient descent algorithm that can be applied to problems where the objective function is not explicitly formulated, including reinforcement learning. It estimates the gradient of an objective function by using the change in the object function in response to the perturbation. The value of the objective function for an unperturbed output is called a baseline. Cho et al. proposed node-perturbation learning with a noisy baseline. In this paper, we report on building the statistical mechanics of Cho's model and on deriving coupled differential equations of order parameters that depict learning dynamics. We also show how to derive the generalization error by solving the differential equations of order parameters. On the basis of the results, we show that Cho's results are also apply in general cases and show some general performances of Cho's model.
A Divergence Bound for Hybrids of MCMC and Variational Inference and an Application to Langevin Dynamics and SGVI
Two popular classes of methods for approximate inference are Markov chain Monte Carlo (MCMC) and variational inference. MCMC tends to be accurate if run for a long enough time, while variational inference tends to give better approximations at shorter time horizons. However, the amount of time needed for MCMC to exceed the performance of variational methods can be quite high, motivating more fine-grained tradeoffs. This paper derives a distribution over variational parameters, designed to minimize a bound on the divergence between the resulting marginal distribution and the target, and gives an example of how to sample from this distribution in a way that interpolates between the behavior of existing methods based on Langevin dynamics and stochastic gradient variational inference (SGVI).
Stochastic modified equations and adaptive stochastic gradient algorithms
Li, Qianxiao, Tai, Cheng, E, Weinan
We develop the method of stochastic modified equations (SME), in which stochastic gradient algorithms are approximated in the weak sense by continuous-time stochastic differential equations. We exploit the continuous formulation together with optimal control theory to derive novel adaptive hyper-parameter adjustment policies. Our algorithms have competitive performance with the added benefit of being robust to varying models and datasets. This provides a general methodology for the analysis and design of stochastic gradient algorithms.
Gradient Descent for Spiking Neural Networks
Huh, Dongsung, Sejnowski, Terrence J.
Much of studies on neural computation are based on network models of static neurons that produce analog output, despite the fact that information processing in the brain is predominantly carried out by dynamic neurons that produce discrete pulses called spikes. Research in spike-based computation has been impeded by the lack of efficient supervised learning algorithm for spiking networks. Here, we present a gradient descent method for optimizing spiking network models by introducing a differentiable formulation of spiking networks and deriving the exact gradient calculation. For demonstration, we trained recurrent spiking networks on two dynamic tasks: one that requires optimizing fast ( millisecond) spike-based interactions for efficient encoding of information, and a delayed-memory XOR task over extended duration ( second). The results show that our method indeed optimizes the spiking network dynamics on the time scale of individual spikes as well as the behavioral time scales. In conclusion, our result offers a general purpose supervised learning algorithm for spiking neural networks, thus advancing further investigations on spike-based computation.
Analysis of a Natural Gradient Algorithm on Monotonic Convex-Quadratic-Composite Functions
In this paper we investigate the convergence properties of a variant of the Covariance Matrix Adaptation Evolution Strategy (CMA-ES). Our study is based on the recent theoretical foundation that the pure rank-mu update CMA-ES performs the natural gradient descent on the parameter space of Gaussian distributions. We derive a novel variant of the natural gradient method where the parameters of the Gaussian distribution are updated along the natural gradient to improve a newly defined function on the parameter space. We study this algorithm on composites of a monotone function with a convex quadratic function. We prove that our algorithm adapts the covariance matrix so that it becomes proportional to the inverse of the Hessian of the original objective function. We also show the speed of covariance matrix adaptation and the speed of convergence of the parameters. We introduce a stochastic algorithm that approximates the natural gradient with finite samples and present some simulated results to evaluate how precisely the stochastic algorithm approximates the deterministic, ideal one under finite samples and to see how similarly our algorithm and the CMA-ES perform.