Gradient Descent
Can stochastic gradient descent (SGD) optimization be used with Multiple Adaptive Regression Spline (MARS)? • /r/MachineLearning
I would like to know if someone has strong arguments for, or against, the technical compatibility of stochastic gradient descent specifically for optimizing the learned parameters of the learning algorithm MARS. I am looking into it. My goal is to hear from others who already looked into it too. I would love to entertain tangential discussions in another thread.
Open sourcing QMF for matrix factorization
Since the objective is an expectation, we can use the Stochastic Gradient Descent (SGD) algorithm to minimize it, by iteratively minimizing the loss on sampled triplets of (user, positive item, negative item). Our implementation uses a form of parallel and asynchronous SGD updates (called Hogwild! [3]), which can give near-linear speedups in terms of the number of processors, especially when the data is sparse, so that concurrent updates are unlikely to be on the same parameters and overwrite each other. In the case of recommendation problems, the data is usually very sparse (unless, e.g., there are few items and most users have seen the same items), and this approach works very well.
Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks
Li, Chunyuan (Duke University) | Chen, Changyou (Duke University) | Carlson, David (Columbia University) | Carin, Lawrence (Duke University)
Effective training of deep neural networks suffers from two main issues. The first is that the parameter space of these models exhibit pathological curvature. Recent methods address this problem by using adaptive preconditioning for Stochastic Gradient Descent (SGD). These methods improve convergence by adapting to the local geometry of parameter space. A second issue is overfitting, which is typically addressed by early stopping. However, recent work has demonstrated that Bayesian model averaging mitigates this problem. The posterior can be sampled by using Stochastic Gradient Langevin Dynamics (SGLD). However, the rapidly changing curvature renders default SGLD methods inefficient. Here, we propose combining adaptive preconditioners with SGLD. In support of this idea, we give theoretical properties on asymptotic convergence and predictive risk. We also provide empirical results for Logistic Regression, Feedforward Neural Nets, and Convolutional Neural Nets, demonstrating that our preconditioned SGLD method gives state-of-the-art performance on these models.
Asynchronous Distributed Semi-Stochastic Gradient Optimization
Zhang, Ruiliang (Hong Kong University of Science and Technology) | Zheng, Shuai (Hong Kong University of Science and Technology) | Kwok, James T. (Hong Kong University of Science and Technology)
With the recent proliferation of large-scale learning problems, there have been a lot of interest on distributed machine learning algorithms, particularly those that are based on stochastic gradient descent (SGD) and its variants. However, existing algorithms either suffer from slow convergence due to the inherent variance of stochastic gradients, or have a fast linear convergence rate but at the expense of poorer solution quality. In this paper, we combine their merits by proposing a fast distributed asynchronous SGD-based algorithm with variance reduction. A constant learning rate can be used, and it is also guaranteed to converge linearly to the optimal solution. Experiments on the Google Cloud Computing Platform demonstrate that the proposed algorithm outperforms state-of-the-art distributed asynchronous algorithms in terms of both wall clock time and solution quality.
DinTucker: Scaling Up Gaussian Process Models on Large Multidimensional Arrays
Zhe, Shandian (Purdue University) | Qi, Yuan (Purdue University) | Park, Youngja ( IBM Thomas J. Watson Research Center ) | Xu, Zenglin (University of Electronic Science and Technology of China) | Molloy, Ian ( IBM Thomas J. Watson Research Center ) | Chari, Suresh ( IBM Thomas J. Watson Research Center )
Tensor decomposition methods are effective tools for modelling multidimensional array data (i.e., tensors). Among them, nonparametric Bayesian models, such as Infinite Tucker Decomposition (InfTucker), are more powerful than multilinear factorization approaches, including Tucker and PARAFAC, and usually achieve better predictive performance. However, they are difficult to handle massive data due to a prohibitively high training cost. To address this limitation, we propose Distributed infinite Tucker (DinTucker), a new hierarchical Bayesian model that enables local learning of InfTucker on subarrays and global information integration from local results. We further develop a distributed stochastic gradient descent algorithm, coupled with variational inference for model estimation. In addition, the connection between DinTucker and InfTucker is revealed in terms of model evidence. Experiments demonstrate that DinTucker maintains the predictive accuracy of InfTucker and is scalable on massive data: On multidimensional arrays with billions of elements from two real-world applications, DinTucker achieves significantly higher prediction accuracy with less training time, compared with the state-of-the-art large-scale tensor decomposition method, GigaTensor.
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Zhao, Shen-Yi (Nanjing University) | Li, Wu-Jun (Nanjing University)
Stochastic gradient descent (SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD methods for multicore systems. However, existing parallel SGD methods cannot achieve satisfactory performance in real applications. In this paper, we propose a fast asynchronous parallel SGD method, called AsySVRG, by designing an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced gradient (SVRG). AsySVRG adopts a lock-free strategy which is more efficient than other strategies with locks. Furthermore, we theoretically prove that AsySVRG is convergent with a linear convergence rate. Both theoretical and empirical results show that AsySVRG can outperform existing state-of-the-art parallel SGD methods like Hogwild! in terms of convergence rate and computation cost.
Stochastic Optimization for Kernel PCA
Zhang, Lijun (Nanjing University) | Yang, Tianbao (University of Iowa) | Yi, Jinfeng (IBM Thomas J. Watson Research Center) | Jin, Rong (Alibaba Group) | Zhou, Zhi-Hua (Nanjing University)
Kernel Principal Component Analysis (PCA) is a popular extension of PCA which is able to find nonlinear patterns from data. However, the application of kernel PCA to large-scale problems remains a big challenge, due to its quadratic space complexity and cubic time complexity in the number of examples. To address this limitation, we utilize techniques from stochastic optimization to solve kernel PCA with linear space and time complexities per iteration. Specifically, we formulate it as a stochastic composite optimization problem, where a nuclear norm regularizer is introduced to promote low-rankness, and then develop a simple algorithm based on stochastic proximal gradient descent. During the optimization process, the proposed algorithm always maintains a low-rank factorization of iterates that can be conveniently held in memory. Compared to previous iterative approaches, a remarkable property of our algorithm is that it is equipped with an explicit rate of convergence. Theoretical analysis shows that the solution of our algorithm converges to the optimal one at an O(1/T) rate, where T is the number of iterations.
Scalable Completion of Nonnegative Matrix with Separable Structure
Yu, Xiyu (University of Technology, Sydney) | Bian, Wei (University of Technology, Sydney) | Tao, Dacheng ( University of Technology, Sydney )
Matrix completion is to recover missing/unobserved values of a data matrix from very limited observations. Due to widely potential applications, it has received growing interests in fields from machine learning, data mining, to collaborative filtering and computer vision. To ensure the successful recovery of missing values, most existing matrix completion algorithms utilise the low-rank assumption, i.e., the fully observed data matrix has a low rank, or equivalently the columns of the matrix can be linearly represented by a few numbers of basis vectors. Although such low-rank assumption applies generally in practice, real-world data can process much richer structural information. In this paper, we present a new model for matrix completion, motivated by the separability assumption of nonnegative matrices from the recent literature of matrix factorisations: there exists a set of columns of the matrix such that the resting columns can be represented by their convex combinations. Given the separability property, which holds reasonably for many applications, our model provides a more accurate matrix completion than the low-rank based algorithms. Further, we derives a scalable algorithm to solve our matrix completion model, which utilises a randomised method to select the basis columns under the separability assumption and a coordinate gradient based method to automatically deal with the structural constraints in optimisation. Compared to the state-of-the-art algorithms, the proposed matrix completion model achieves competitive results on both synthetic and real datasets.
Expected Tensor Decomposition with Stochastic Gradient Descent
Maehara, Takanori (Shizuoka University) | Hayashi, Kohei (National Institute of Informatics) | Kawarabayashi, Ken-ichi (National Institute of Informatics)
In this study, we investigate expected CP decomposition — a special case of CP decomposition in which a tensor to be decomposed is given as the sum or average of tensor samples X ( t ) for t = 1,..., T . To determine this decomposition, we develope stochastic-gradient-descent-type algorithms with four appealing features: efficient memory use, ability to work in an online setting, robustness of parameter tuning, and simplicity. Our theoretical analysis show that the solutions do not diverge to infinity for any initial value or step size. Experimental results confirm that our algorithms significantly outperform all existing methods in terms of accuracy. We also show that they can successfully decompose a large tensor, containing billion-scale nonzero elements.