AITopics

@machinelearnbotApr-25-2016, 05:34:34 GMT

Why is the unit of gradient descent in y instead of in x? • /r/MachineLearning

Why is the unit of gradient descent in y instead of in x? (self.MachineLearning) The direction of the gradient make sense, but the value seems to have the wrong unit.

artificial intelligence, gradient descent, machine learning, (2 more...)

@machinelearnbot

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.86)

#artificialintelligenceApr-19-2016, 19:42:18 GMT

Open sourcing QMF for matrix factorization

Since the objective is an expectation, we can use the Stochastic Gradient Descent (SGD) algorithm to minimize it, by iteratively minimizing the loss on sampled triplets of (user, positive item, negative item). Our implementation uses a form of parallel and asynchronous SGD updates (called Hogwild! [3]), which can give near-linear speedups in terms of the number of processors, especially when the data is sparse, so that concurrent updates are unlikely to be on the same parameters and overwrite each other. In the case of recommendation problems, the data is usually very sparse (unless, e.g., there are few items and most users have seen the same items), and this approach works very well.

artificial intelligence, machine learning, matrix factorization, (2 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.71)

Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks

Li, Chunyuan (Duke University) | Chen, Changyou (Duke University) | Carlson, David (Columbia University) | Carin, Lawrence (Duke University)

deep neural network, preconditioned stochastic gradient langevin dynamic, stochastic gradient langevin dynamic, (2 more...)

Effective training of deep neural networks suffers from two main issues. The first is that the parameter space of these models exhibit pathological curvature. Recent methods address this problem by using adaptive preconditioning for Stochastic Gradient Descent (SGD). These methods improve convergence by adapting to the local geometry of parameter space. A second issue is overfitting, which is typically addressed by early stopping. However, recent work has demonstrated that Bayesian model averaging mitigates this problem. The posterior can be sampled by using Stochastic Gradient Langevin Dynamics (SGLD). However, the rapidly changing curvature renders default SGLD methods inefficient. Here, we propose combining adaptive preconditioners with SGLD. In support of this idea, we give theoretical properties on asymptotic convergence and predictive risk. We also provide empirical results for Logistic Regression, Feedforward Neural Nets, and Convolutional Neural Nets, demonstrating that our preconditioned SGLD method gives state-of-the-art performance on these models.

Thirtieth AAAI Conference on Artificial Intelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Zhang, Ruiliang (Hong Kong University of Science and Technology) | Zheng, Shuai (Hong Kong University of Science and Technology) | Kwok, James T. (Hong Kong University of Science and Technology)

Asynchronous Distributed Semi-Stochastic Gradient Optimization

With the recent proliferation of large-scale learning problems, there have been a lot of interest on distributed machine learning algorithms, particularly those that are based on stochastic gradient descent (SGD) and its variants. However, existing algorithms either suffer from slow convergence due to the inherent variance of stochastic gradients, or have a fast linear convergence rate but at the expense of poorer solution quality. In this paper, we combine their merits by proposing a fast distributed asynchronous SGD-based algorithm with variance reduction. A constant learning rate can be used, and it is also guaranteed to converge linearly to the optimal solution. Experiments on the Google Cloud Computing Platform demonstrate that the proposed algorithm outperforms state-of-the-art distributed asynchronous algorithms in terms of both wall clock time and solution quality.

algorithm, gradient, server, (15 more...)

Thirtieth AAAI Conference on Artificial Intelligence

Country:

Asia > China > Hong Kong (0.04)
Asia > Middle East > Jordan (0.04)

Industry:

Education (0.66)
Information Technology > Services (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

DinTucker: Scaling Up Gaussian Process Models on Large Multidimensional Arrays

Zhe, Shandian (Purdue University) | Qi, Yuan (Purdue University) | Park, Youngja ( IBM Thomas J. Watson Research Center ) | Xu, Zenglin (University of Electronic Science and Technology of China) | Molloy, Ian ( IBM Thomas J. Watson Research Center ) | Chari, Suresh ( IBM Thomas J. Watson Research Center )

Tensor decomposition methods are effective tools for modelling multidimensional array data (i.e., tensors). Among them, nonparametric Bayesian models, such as Infinite Tucker Decomposition (InfTucker), are more powerful than multilinear factorization approaches, including Tucker and PARAFAC, and usually achieve better predictive performance. However, they are difficult to handle massive data due to a prohibitively high training cost. To address this limitation, we propose Distributed infinite Tucker (DinTucker), a new hierarchical Bayesian model that enables local learning of InfTucker on subarrays and global information integration from local results. We further develop a distributed stochastic gradient descent algorithm, coupled with variational inference for model estimation. In addition, the connection between DinTucker and InfTucker is revealed in terms of model evidence. Experiments demonstrate that DinTucker maintains the predictive accuracy of InfTucker and is scalable on massive data: On multidimensional arrays with billions of elements from two real-world applications, DinTucker achieves significantly higher prediction accuracy with less training time, compared with the state-of-the-art large-scale tensor decomposition method, GigaTensor.

artificial intelligence, inftucker, machine learning, (17 more...)

Thirtieth AAAI Conference on Artificial Intelligence

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Zhao, Shen-Yi (Nanjing University) | Li, Wu-Jun (Nanjing University)

Stochastic gradient descent (SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD methods for multicore systems. However, existing parallel SGD methods cannot achieve satisfactory performance in real applications. In this paper, we propose a fast asynchronous parallel SGD method, called AsySVRG, by designing an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced gradient (SVRG). AsySVRG adopts a lock-free strategy which is more efficient than other strategies with locks. Furthermore, we theoretically prove that AsySVRG is convergent with a linear convergence rate. Both theoretical and empirical results show that AsySVRG can outperform existing state-of-the-art parallel SGD methods like Hogwild! in terms of convergence rate and computation cost.

artificial intelligence, asysvrg, machine learning, (13 more...)

Thirtieth AAAI Conference on Artificial Intelligence

Country: Asia (0.28)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Stochastic Optimization for Kernel PCA

Zhang, Lijun (Nanjing University) | Yang, Tianbao (University of Iowa) | Yi, Jinfeng (IBM Thomas J. Watson Research Center) | Jin, Rong (Alibaba Group) | Zhou, Zhi-Hua (Nanjing University)

Kernel Principal Component Analysis (PCA) is a popular extension of PCA which is able to find nonlinear patterns from data. However, the application of kernel PCA to large-scale problems remains a big challenge, due to its quadratic space complexity and cubic time complexity in the number of examples. To address this limitation, we utilize techniques from stochastic optimization to solve kernel PCA with linear space and time complexities per iteration. Specifically, we formulate it as a stochastic composite optimization problem, where a nuclear norm regularizer is introduced to promote low-rankness, and then develop a simple algorithm based on stochastic proximal gradient descent. During the optimization process, the proposed algorithm always maintains a low-rank factorization of iterates that can be conveniently held in memory. Compared to previous iterative approaches, a remarkable property of our algorithm is that it is equipped with an explicit rate of convergence. Theoretical analysis shows that the solution of our algorithm converges to the optimal one at an O(1/T) rate, where T is the number of iterations.

algorithm, artificial intelligence, machine learning, (17 more...)

Thirtieth AAAI Conference on Artificial Intelligence

Country: North America > United States (0.28)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

Scalable Completion of Nonnegative Matrix with Separable Structure

Yu, Xiyu (University of Technology, Sydney) | Bian, Wei (University of Technology, Sydney) | Tao, Dacheng ( University of Technology, Sydney )

Matrix completion is to recover missing/unobserved values of a data matrix from very limited observations. Due to widely potential applications, it has received growing interests in fields from machine learning, data mining, to collaborative filtering and computer vision. To ensure the successful recovery of missing values, most existing matrix completion algorithms utilise the low-rank assumption, i.e., the fully observed data matrix has a low rank, or equivalently the columns of the matrix can be linearly represented by a few numbers of basis vectors. Although such low-rank assumption applies generally in practice, real-world data can process much richer structural information. In this paper, we present a new model for matrix completion, motivated by the separability assumption of nonnegative matrices from the recent literature of matrix factorisations: there exists a set of columns of the matrix such that the resting columns can be represented by their convex combinations. Given the separability property, which holds reasonably for many applications, our model provides a more accurate matrix completion than the low-rank based algorithms. Further, we derives a scalable algorithm to solve our matrix completion model, which utilises a randomised method to select the basis columns under the separability assumption and a coordinate gradient based method to automatically deal with the structural constraints in optimisation. Compared to the state-of-the-art algorithms, the proposed matrix completion model achieves competitive results on both synthetic and real datasets.

artificial intelligence, machine learning, matrix completion, (16 more...)

Thirtieth AAAI Conference on Artificial Intelligence

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.49)

Maehara, Takanori (Shizuoka University) | Hayashi, Kohei (National Institute of Informatics) | Kawarabayashi, Ken-ichi (National Institute of Informatics)

Expected Tensor Decomposition with Stochastic Gradient Descent

In this study, we investigate expected CP decomposition — a special case of CP decomposition in which a tensor to be decomposed is given as the sum or average of tensor samples X ( t ) for t = 1,..., T . To determine this decomposition, we develope stochastic-gradient-descent-type algorithms with four appealing features: efficient memory use, ability to work in an online setting, robustness of parameter tuning, and simplicity. Our theoretical analysis show that the solutions do not diverge to infinity for any initial value or step size. Experimental results confirm that our algorithms significantly outperform all existing methods in terms of accuracy. We also show that they can successfully decompose a large tensor, containing billion-scale nonzero elements.

artificial intelligence, decomposition, machine learning, (17 more...)

Thirtieth AAAI Conference on Artificial Intelligence

Country: Asia > Japan > Honshū (0.28)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)