Gradient Descent
Distributed Word2Vec using Graph Analytics Frameworks
Gill, Gurbinder, Dathathri, Roshan, Maleki, Saeed, Musuvathi, Madan, Mytkowicz, Todd, Saarikivi, Olli
Word embeddings capture semantic and syntactic similarities of words, represented as vectors. Word2Vec is a popular implementation of word embeddings; it takes as input a large corpus of text and learns a model that maps unique words in that corpus to other contextually relevant words. After training, Word2Vec's internal vector representation of words in the corpus map unique words to a vector space, which are then used in many downstream tasks. Training these models requires significant computational resources (training time often measured in days) and is difficult to parallelize. Most word embedding training uses stochastic gradient descent (SGD), an "inherently" sequential algorithm where at each step, the processing of the current example depends on the parameters learned from the previous examples. Prior approaches to parallelizing SGD do not honor these dependencies and thus potentially suffer poor convergence. This paper introduces GraphWord2Vec, a distributedWord2Vec algorithm which formulates the Word2Vec training process as a distributed graph problem and thus leverage state-of-the-art distributed graph analytics frameworks such as D-Galois and Gemini that scale to large distributed clusters. GraphWord2Vec also demonstrates how to use model combiners to honor data dependencies in SGD and thus scale without giving up convergence. We will show that GraphWord2Vec has linear scalability up to 32 machines converging as fast as a sequential run in terms of epochs, thus reducing training time by 14x.
Towards Understanding the Importance of Noise in Training Neural Networks
Zhou, Mo, Liu, Tianyi, Li, Yan, Lin, Dachao, Zhou, Enlu, Zhao, Tuo
Numerous empirical evidence has corroborated that the noise plays a crucial rule in effective and efficient training of neural networks. The theory behind, however, is still largely unknown. This paper studies this fundamental problem through training a simple two-layer convolutional neural network model. Although training such a network requires solving a nonconvex optimization problem with a spurious local optimum and a global optimum, we prove that perturbed gradient descent and perturbed mini-batch stochastic gradient algorithms in conjunction with noise annealing is guaranteed to converge to a global optimum in polynomial time with arbitrary initialization. This implies that the noise enables the algorithm to efficiently escape from the spurious local optimum. Numerical experiments are provided to support our theory.
Decentralized Stochastic Gradient Tracking for Empirical Risk Minimization
Recent works have shown superiorities of decentralized SGD to centralized counterparts in large-scale machine learning, but their theoretical gap is still not fully understood. In this paper, we propose a decentralized stochastic gradient tracking (DSGT) algorithm over peer-to-peer networks for empirical risk minimization problems, and explicitly evaluate its convergence rate in terms of key parameters of the problem, e.g., algebraic connectivity of the communication network, mini-batch size, and gradient variance. Importantly, it is the first theoretical result that can \emph{exactly} recover the rate of the centralized SGD, and has optimal dependence on the algebraic connectivity of the networks when using stochastic gradients. Moreover, we explicitly quantify how the network affects speedup and the rate improvement over existing works. Interestingly, we also point out for the first time that both linear and sublinear speedup can be possible. We empirically validate DSGT on neural networks and logistic regression problems, and show its advantage over the state-of-the-art algorithms.
Port-Hamiltonian Approach to Neural Network Training
Massaroli, Stefano, Poli, Michael, Califano, Federico, Faragasso, Angela, Park, Jinkyoo, Yamashita, Atsushi, Asama, Hajime
Neural networks are discrete entities: subdivided into discrete layers and parametrized by weights which are iteratively optimized via difference equations. Recent work proposes networks with layer outputs which are no longer quantized but are solutions of an ordinary differential equation (ODE); however, these networks are still optimized via discrete methods (e.g. gradient descent). In this paper, we explore a different direction: namely, we propose a novel framework for learning in which the parameters themselves are solutions of ODEs. By viewing the optimization process as the evolution of a port-Hamiltonian system, we can ensure convergence to a minimum of the objective function. Numerical experiments have been performed to show the validity and effectiveness of the proposed methods.
A Nonconvex Approach for Exact and Efficient Multichannel Sparse Blind Deconvolution
Qu, Qing, Li, Xiao, Zhu, Zhihui
We study the multi-channel sparse blind deconvolution (MCS-BD) problem, whose task is to simultaneously recover a kernel $\mathbf a$ and multiple sparse inputs $\{\mathbf x_i\}_{i=1}^p$ from their circulant convolution $\mathbf y_i = \mathbf a \circledast \mathbf x_i $ ($i=1,\cdots,p$). We formulate the task as a nonconvex optimization problem over the sphere. Under mild statistical assumptions of the data, we prove that the vanilla Riemannian gradient descent (RGD) method, with random initializations, provably recovers both the kernel $\mathbf a$ and the signals $\{\mathbf x_i\}_{i=1}^p$ up to a signed shift ambiguity. In comparison with state-of-the-art results, our work shows significant improvements in terms of sample complexity and computational efficiency. Our theoretical results are corroborated by numerical experiments, which demonstrate superior performance of the proposed approach over the previous methods on both synthetic and real datasets.
Gradient Descent based Weight Learning for Grouping Problems: Application on Graph Coloring and Equitable Graph Coloring
Goudet, Olivier, Duval, Bรฉatrice, Hao, Jin-Kao
A grouping problem involves partitioning a set of items into mutually disjoint groups or clusters according to some guiding decision criteria and imperative constraints. Grouping problems have many relevant applications and are computationally difficult. In this work, we present a general weight learning based optimization framework for solving grouping problems. The central idea of our approach is to formulate the task of seeking a solution as a real-valued weight matrix learning problem that is solved by first order gradient descent. A practical implementation of this framework is proposed with tensor calculus in order to benefit from parallel computing on GPU devices. To show its potential for tackling difficult problems, we apply the approach to two typical and well-known grouping problems (graph coloring and equitable graph coloring). We present large computational experiments and comparisons on popular benchmarks and report improved best-known results (new upper bounds) for several large graphs.
Partitioned integrators for thermodynamic parameterization of neural networks
Stochastic Gradient Langevin Dynamics, the "unadjusted Langevin algorithm", and Adaptive Langevin Dynamics (also known as Stochastic Gradient Nosรฉ-Hoover dynamics) are examples of existing thermodynamic parameterization methods in use for machine learning, but these can be substantially improved. We find that by partitioning the parameters based on natural layer structure we obtain schemes with rapid convergence for data sets with complicated loss landscapes. We describe easy-to-implement hybrid partitioned numerical algorithms, based on discretized stochastic differential equations, which are adapted to feed-forward neural networks, including LaLa (a multi-layer Langevin algorithm), AdLaLa (combining the adaptive Langevin and Langevin algorithms) and LOL (combining Langevin and Overdamped Langevin); we examine the convergence of these methods using numerical studies and compare their performance among themselves and in relation to standard alternatives such as stochastic gradient descent and ADAM. We present evidence that thermodynamic parameterization methods can be (i) faster, (ii) more accurate, and (iii) more robust than standard algorithms incorporated into machine learning frameworks, in particular for data sets with complicated loss landscapes. Moreover, we show in numerical studies that methods based on sampling excite many degrees of freedom.
Efficiency of Coordinate Descent Methods For Structured Nonconvex Optimization
Novel coordinate descent (CD) methods are proposed for minimizing nonconvex functions consisting of three terms: (i) a continuously differentiable term, (ii) a simple convex term, and (iii) a concave and continuous term. First, by extending randomized CD to nonsmooth nonconvex settings, we develop a coordinate subgradient method that randomly updates block-coordinate variables by using block composite subgradient mapping. This method converges asymptotically to critical points with proven sublinear convergence rate for certain optimality measures. Second, we develop a randomly permuted CD method with two alternating steps: linearizing the concave part and cycling through variables. We prove asymptotic convergence to critical points and sublinear complexity rate for objectives with both smooth and concave parts. Third, we extend accelerated coordinate descent (ACD) to nonsmooth and nonconvex optimization to develop a novel randomized proximal DC algorithm whereby we solve the subproblem inexactly by ACD. Convergence is guaranteed with at most a few number of ACD iterations for each DC subproblem, and convergence complexity is established for identification of some approximate critical points. Fourth, we further develop the third method to minimize certain ill-conditioned nonconvex functions: weakly convex functions with high Lipschitz constant to negative curvature ratios. We show that, under specific criteria, the ACD-based randomized method has superior complexity compared to conventional gradient methods. Finally, an empirical study on sparsity-inducing learning models demonstrates that CD methods are superior to gradient-based methods for certain large-scale problems.
Simple and optimal high-probability bounds for strongly-convex stochastic gradient descent
Harvey, Nicholas J. A., Liaw, Christopher, Randhawa, Sikander
We consider stochastic gradient descent algorithms for minimizing a non-smooth, strongly-convex function. Several forms of this algorithm, including suffix averaging, are known to achieve the optimal $O(1/T)$ convergence rate in expectation. We consider a simple, non-uniform averaging strategy of Lacoste-Julien et al. (2011) and prove that it achieves the optimal $O(1/T)$ convergence rate with high probability. Our proof uses a recently developed generalization of Freedman's inequality. Finally, we compare several of these algorithms experimentally and show that this non-uniform averaging strategy outperforms many standard techniques, and with smaller variance.
When BERT meets Pytorch
We keep the BERT encoder unfrozen so that all weights are updated with every iteration. Given the number of trainable parameters it's useful to train the model on multiple GPUs in parallel. I used 4 Tesla K80's for about 4500 training samples. Just remember that to access any model attribute, you can access it using modelName.module.attribute I used Stochastic Gradient Descent with momentum as the optimizer and found that cycling both the learning rates and momentum really helped to get the training and validation losses down.