Gradient Descent
Asynchronous Decentralized Parallel Stochastic Gradient Descent
Lian, Xiangru, Zhang, Wei, Zhang, Ce, Liu, Ji
Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a parameter server suffer from 1) communication bottleneck at parameter servers when workers are many, and 2) significantly worse convergence when the traffic to parameter server is congested. Can we design an algorithm that is robust in a heterogeneous environment, while being communication efficient and maintaining the best-possible convergence rate? In this paper, we propose an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations. Our theoretical analysis shows AD-PSGD converges at the optimal $O(1/\sqrt{K})$ rate as SGD and has linear speedup w.r.t. number of workers. Empirically, AD-PSGD outperforms the best of decentralized parallel SGD (D-PSGD), asynchronous parallel SGD (A-PSGD), and standard data parallel SGD (AllReduce-SGD), often by orders of magnitude in a heterogeneous environment. When training ResNet-50 on ImageNet with up to 128 GPUs, AD-PSGD converges (w.r.t epochs) similarly to the AllReduce-SGD, but each epoch can be up to 4-8X faster than its synchronous counterparts in a network-sharing HPC environment. To the best of our knowledge, AD-PSGD is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale.
Training and Evaluating Improved Dependency-Based Word Embeddings
Li, Chen (BeiHang University) | Li, Jianxin (BeiHang University) | Song, Yangqiu (Hong Kong University of Science and Technology) | Lin, Ziwei (BeiHang University)
Word embedding has been widely used in many natural language processing tasks. In this paper, we focus on learning word embeddings through selective higher-order relationships in sentences to improve the embeddings to be less sensitive to local context and more accurate in capturing semantic compositionality. We present a novel multi-order dependency-based strategy to composite and represent the context under several essential constraints. In order to realize selective learning from the word contexts, we automatically assign the strengths of different dependencies between co-occurred words in the stochastic gradient descent process. We evaluate and analyze our proposed approach using several direct and indirect tasks for word embeddings. Experimental results demonstrate that our embeddings are competitive to or better than state-of-the-art methods and significantly outperform other methods in terms of context stability. The output weights and representations of dependencies obtained in our embedding model conform to most of the linguistic characteristics and are valuable for many downstream tasks.
Stochastic Non-Convex Ordinal Embedding With Stabilized Barzilai-Borwein Step Size
Ma, Ke (Institute of Information Engineering, Chinese Academy of Sciences) | Zeng, Jinshan (School of Cyber Security, University of Chinese Academy of Sciences) | Xiong, Jiechao (School of Computer Information Engineering, Jiangxi Normal University) | Xu, Qianqian (Hong Kong University of Science and Technology) | Cao, Xiaochun (Tencent AI Lab) | Liu, Wei (Institute of Information Engineering, Chinese Academy of Sciences) | Yao, Yuan (Institute of Information Engineering, Chinese Academy of Sciences)
Learning representation from relative similarity comparisons, often called ordinal embedding, gains rising attention in recent years. Most of the existing methods are batch methods designed mainly based on the convex optimization, say, the projected gradient descent method. However, they are generally time-consuming due to that the singular value decomposition (SVD) is commonly adopted during the update, especially when the data size is very large. To overcome this challenge, we propose a stochastic algorithm called SVRG-SBB, which has the following features: (a) SVD-free via dropping convexity, with good scalability by the use of stochastic algorithm, i.e., stochastic variance reduced gradient (SVRG), and (b) adaptive step size choice via introducing a new stabilized Barzilai-Borwein (SBB) method as the original version for convex problems might fail for the considered stochastic non-convex optimization problem. Moreover, we show that the proposed algorithm converges to a stationary point at a rate O (1/ T ) in our setting, where T is the number of total iterations. Numerous simulations and real-world data experiments are conducted to show the effectiveness of the proposed algorithm via comparing with the state-of-the-art methods, particularly, much lower computational cost with good prediction performance.
Riemannian Stein Variational Gradient Descent for Bayesian Inference
Liu, Chang (Tsinghua University) | Zhu, Jun (Tsinghua University)
We develop Riemannian Stein Variational Gradient Descent (RSVGD), a Bayesian inference method that generalizes Stein Variational Gradient Descent (SVGD) to Riemann manifold. The benefits are two-folds: (i) for inference tasks in Euclidean spaces, RSVGD has the advantage over SVGD of utilizing information geometry, and (ii) for inference tasks on Riemann manifolds, RSVGD brings the unique advantages of SVGD to the Riemannian world. To appropriately transfer to Riemann manifolds, we conceive novel and non-trivial techniques for RSVGD, which are required by the intrinsically different characteristics of general Riemann manifolds from Euclidean spaces. We also discover Riemannian Stein's Identity and Riemannian Kernelized Stein Discrepancy. Experimental results show the advantages over SVGD of exploring distribution geometry and the advantages of particle-efficiency, iteration-effectiveness and approximation flexibility over other inference methods on Riemann manifolds.
Statistical Inference Using SGD
Li, Tianyang (University of Texas at Austin) | Liu, Liu (University of Texas at Austin) | Kyrillidis, Anastasios ( IBM T.J. Watson Research Center, Yorktown Heights ) | Caramanis, Constantine (University of Texas at Austin)
We present a novel method for frequentist statistical inference in M-estimation problems, based on stochastic gradient descent (SGD) with a fixed step size: we demonstrate that the average of such SGD sequences can be used for statistical inference, after proper scaling. An intuitive analysis using the Ornstein-Uhlenbeck process suggests that such averages are asymptotically normal. To show the merits of our scheme, we apply it to both synthetic and real data sets, and demonstrate that its accuracy is comparable to classical statistical methods, while requiring potentially far less computation.
Deep Reinforcement Learning That Matters
Henderson, Peter (McGill University) | Islam, Riashat (McGill University) | Bachman, Philip (Microsoft) | Pineau, Joelle (McGill University) | Precup, Doina (McGill University) | Meger, David (McGill University)
In recent years, significant progress has been made in solving challenging problems across various domains using deep reinforcement learning (RL). Reproducing existing work and accurately judging the improvements offered by novel methods is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art deep RL methods is seldom straightforward. In particular, non-determinism in standard benchmark environments, combined with variance intrinsic to the methods, can make reported results tough to interpret. Without significance metrics and tighter standardization of experimental reporting, it is difficult to determine whether improvements over the prior state-of-the-art are meaningful. In this paper, we investigate challenges posed by reproducibility, proper experimental techniques, and reporting procedures. We illustrate the variability in reported metrics and results when comparing against common baselines and suggest guidelines to make future results in deep RL more reproducible. We aim to spur discussion about how to ensure continued progress in the field by minimizing wasted effort stemming from results that are non-reproducible and easily misinterpreted.
Mini-Batch Stochastic ADMMs for Nonconvex Nonsmooth Optimization
In the paper, we study the mini-batch stochastic ADMMs (alternating direction method of multipliers) for the nonconvex nonsmooth optimization. We prove that, given an appropriate mini-batch size, the mini-batch stochastic ADMM without variance reduction (VR) technique is convergent and reaches the convergence rate of $O(1/T)$ to obtain a stationary point of the nonconvex optimization, where $T$ denotes the number of iterations. Moreover, we extend the mini-batch stochastic gradient method to both the nonconvex SVRG-ADMM and SAGA-ADMM in our initial paper \citep{huang2016stochastic}, and also prove that these mini-batch stochastic ADMMs reach the convergence rate of $O(1/T)$ without the condition on the mini-batch size. In particular, we provide a specific parameter selection for step size $\eta$ of stochastic gradients and penalization parameter $\rho$ of the augmented Lagrangian function. Finally, some experimental results demonstrate the effectiveness of our algorithms.
Learning One Convolutional Layer with Overlapping Patches
Goel, Surbhi, Klivans, Adam, Meka, Raghu
We give the first provably efficient algorithm for learning a one hidden layer convolutional network with respect to a general class of (potentially overlapping) patches. Additionally, our algorithm requires only mild conditions on the underlying distribution. We prove that our framework captures commonly used schemes from computer vision, including one-dimensional and two-dimensional "patch and stride" convolutions. Our algorithm-- $Convotron$ -- is inspired by recent work applying isotonic regression to learning neural networks. Convotron uses a simple, iterative update rule that is stochastic in nature and tolerant to noise (requires only that the conditional mean function is a one layer convolutional network, as opposed to the realizable setting). In contrast to gradient descent, Convotron requires no special initialization or learning-rate tuning to converge to the global optimum. We also point out that learning one hidden convolutional layer with respect to a Gaussian distribution and just $one$ disjoint patch $P$ (the other patches may be arbitrary) is $easy$ in the following sense: Convotron can efficiently recover the hidden weight vector by updating $only$ in the direction of $P$.
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
Ma, Siyuan, Bassily, Raef, Belkin, Mikhail
Stochastic Gradient Descent (SGD) with small mini-batch is a key component in modern large-scale learning. However, its efficiency has not been easy to analyze as most theoretical results require adaptive rates and show convergence rates far slower than that for gradient descent, making computational comparisons difficult. In this paper we aim to clarify the issue of fast SGD convergence. The key observation is that most modern architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero. While it is still unclear why these interpolated solutions perform well on test data, we show that these regimes allow for very fast convergence of SGD, comparable in the number of iterations to full gradient descent. Specifically, consider the setting with a quadratic objective function, or a general objective function in the proximity of a minimum, where the quadratic term is dominant. First, we obtain the optimal convergence rate for any mini-batch SGD and derive the optimal step size as a function of the batch size $m$. Second, we show: (1) $m=1$ is optimal in terms of number of computations required to achieve any desired accuracy. (2) There is a critical mini-batch size $m^*$ such that: (2a) SGD iteration with batch size $m\leq m^*$ is nearly equivalent to $m$ iterations of batch size $1$. (2b) SGD iteration with mini-batch $m> m^*$ is nearly equivalent to a full gradient descent iteration. The critical mini-batch size can be viewed as the limit for effective mini-batch parallelization. It is also nearly independent of the data size, implying $O(n)$ acceleration over GD per unit of computation. These theoretical analyses are verified by experimental results. Finally, we show how the interpolation perspective and our results fit in the deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.
Towards Shockingly Easy Structured Classification: A Search-based Probabilistic Online Learning Framework
There are two major approaches for structured classification. One is the probabilistic gradient-based methods such as conditional random fields (CRF), which has high accuracy but with drawbacks: slow training, and no support of search-based optimization (which is important in many cases). The other one is the search-based learning methods such as perceptrons and margin infused relaxed algorithm (MIRA), which have fast training but also with drawbacks: low accuracy, no probabilistic information, and non-convergence in real-world tasks. We propose a novel and "shockingly easy" solution, a search-based probabilistic online learning method, to address most of those issues. This method searches the output candidates, derives probabilities, and conduct efficient online learning. We show that this method is with fast training, support search-based optimization, very easy to implement, with top accuracy, with probabilities, and with theoretical guarantees of convergence. Experiments on well-known tasks show that our method has better accuracy than CRF and almost as fast training speed as perceptron and MIRA. Results also show that SAPO can easily beat the state-of-the-art systems on those highly-competitive tasks, achieving record-breaking accuracies. The codes can be found at https://github.com/lancopku