AITopics

1605.069

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.55)

arXiv.org Machine LearningMay-22-2016

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Tan, Conghui, Ma, Shiqian, Dai, Yu-Hong, Qian, Yuqiu

One of the major issues in stochastic gradient descent (SGD) methods is how to choose an appropriate step size while running the algorithm. Since the traditional line search technique does not apply for stochastic optimization algorithms, the common practice in SGD is either to use a diminishing step size, or to tune a fixed step size by hand, which can be time consuming in practice. In this paper, we propose to use the Barzilai-Borwein (BB) method to automatically compute step sizes for SGD and its variant: stochastic variance reduced gradient (SVRG) method, which leads to two algorithms: SGD-BB and SVRG-BB. We prove that SVRG-BB converges linearly for strongly convex objective functions. As a by-product, we prove the linear convergence result of SVRG with Option I proposed in [10], whose convergence result is missing in the literature. Numerical experiments on standard data sets show that the performance of SGD-BB and SVRG-BB is comparable to and sometimes even better than SGD and SVRG with best-tuned step sizes, and is superior to some advanced SGD variants.

artificial intelligence, machine learning, step size, (15 more...)

1605.04131

Country: Asia > China (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Machine LearningMay-21-2016

Make Workers Work Harder: Decoupled Asynchronous Proximal Stochastic Gradient Descent

Li, Yitan, Xu, Linli, Zhong, Xiaowei, Ling, Qing

With the enormous growth of data size n and model complexity, asynchronous parallel algorithms [1, 2, 3, 4, 5, 6] have become an important tool and received significant successes for solving large scale machine learning problems in the form of (1). Asynchronous parallel algorithms distribute computation on multicore systems (shared memory architecture) or multi-machine system (parameter server architecture), whose computation power generally scales up with the increasing number of cores or machines. As a consequence, effective design and implementation of asynchronous parallel algorithms is critical for large scale machine learning. Numerous efforts have been devoted to this topic. Among them, asynchronous stochastic gradient descent is proposed in [1, 2], and its performance is guaranteed by theoretical convergence analyses. An asynchronous proximal gradient descent algorithm is designed on the parameter server architecture in [3] with a distributed optimization software provided. Convergence rate of asynchronous stochastic gradient descent with a nonconvex objective is analyzed in [4].

artificial intelligence, machine learning, proximal operator, (12 more...)

1605.06619

Country:

Europe (0.46)
North America > Canada > Quebec (0.14)

Genre: Research Report (0.82)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

arXiv.org Machine LearningMay-15-2016

Tracking Slowly Moving Clairvoyant: Optimal Dynamic Regret of Online Learning with True and Noisy Gradient

Yang, Tianbao, Zhang, Lijun, Jin, Rong, Yi, Jinfeng

This work focuses on dynamic regret of online convex optimization that compares the performance of online learning to a clairvoyant who knows the sequence of loss functions in advance and hence selects the minimizer of the loss function at each step. By assuming that the clairvoyant moves slowly (i.e., the minimizers change slowly), we present several improved variation-based upper bounds of the dynamic regret under the true and noisy gradient feedback, which are {\it optimal} in light of the presented lower bounds. The key to our analysis is to explore a regularity metric that measures the temporal changes in the clairvoyant's minimizers, to which we refer as {\it path variation}. Firstly, we present a general lower bound in terms of the path variation, and then show that under full information or gradient feedback we are able to achieve an optimal dynamic regret. Secondly, we present a lower bound with noisy gradient feedback and then show that we can achieve optimal dynamic regrets under a stochastic gradient feedback and two-point bandit feedback. Moreover, for a sequence of smooth loss functions that admit a small variation in the gradients, our dynamic regret under the two-point bandit feedback matches what is achieved with full information.

artificial intelligence, dynamic regret, machine learning, (14 more...)

1605.04638

Country: North America > United States > Iowa (0.28)

Genre: Research Report (0.64)

Industry: Education > Educational Setting > Online (0.62)

Technology:

Information Technology > Enterprise Applications > Human Resources > Learning Management (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

#artificialintelligenceMay-11-2016, 20:01:32 GMT

Pretty fast word2vec with Numba

At a high level, the skip-gram flavor of this algorithm looks at a word and its surrounding words, and tries to maximize the probability that the word's vector representation predicts those actual words occurring around it. If it is trained on the phrase the quick brown fox jumps, the word2vec input representation of the word brown would yield a high dot product with the output vectors for the words the, quick, fox and jumps. And if the algorithm is trained on a lot more text, there's a good chance that it will start to learn that the words brown and red appear in similar contexts, so that their vector representations will be pretty close to one another's. I'm constantly trying to navigate the trade-off from simultaneously maximizing my laziness and the speed of my code. Aiming for the lazy side, I originally wanted to experiment with using autograd to write my gradient updates for me, and as nice as it was to not have to write out the gradient calculation, this pretty quickly bubbled to the top as one of the major bottlenecks, leaving me to write the gradients manually by faster means.

artificial intelligence, machine learning, representation, (6 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.39)

Schmidt, Mark, Roux, Nicolas Le, Bach, Francis

Minimizing Finite Sums with the Stochastic Average Gradient

arXiv.org Machine LearningMay-11-2016

We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than black-box SG methods. The convergence rate is improved from O(1/k^{1/2}) to O(1/k) in general, and when the sum is strongly-convex the convergence rate is improved from the sub-linear O(1/k) to a linear convergence rate of the form O(p^k) for p \textless{} 1. Further, in many cases the convergence rate of the new method is also faster than black-box deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.

artificial intelligence, data mining, machine learning, (17 more...)

1309.2388

Country: North America (0.27)

Genre: Research Report > New Finding (0.46)

Industry: Transportation > Air (0.54)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
Information Technology > Data Science > Data Mining (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.35)

#artificialintelligenceMay-10-2016, 05:55:54 GMT

Training Neural Networks Without Gradients: A Scalable ADMM Approach

With the growing importance of large network models and enormous training datasets, GPUs have become increasingly necessary to train neural networks. This is largely because conventional optimization algorithms rely on stochastic gradient methods that don't scale well to large numbers of cores in a cluster setting. Furthermore, the convergence of all gradient methods, including batch methods, suffers from common problems like saturation effects, poor conditioning, and saddle points. This paper explores an unconventional training method that uses alternating direction methods and Bregman iteration to train networks without gradient descent steps. The proposed method reduces the network training problem to a sequence of minimization sub-steps that can each be solved globally in closed form. The proposed method is advantageous because it avoids many of the caveats that make gradient methods slow on highly non-convex problems.

artificial intelligence, machine learning, training neural network, (2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.66)

#artificialintelligenceMay-7-2016, 06:55:31 GMT

Must Know Tips for Deep Learning Neural Networks, Part 2

Editor's note: Yesterday we posted the first 4 tips & ticks of this fascinating post. Be sure to have a look before reading the second half. One of the crucial factors in deep networks is activation function, which brings the non-linearity into networks. Here we will introduce the details and characters of some popular activation functions and give advices later in this section. The sigmoid non-linearity has the mathematical form?

artificial intelligence, machine learning, neuron, (18 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.31)

#artificialintelligenceMay-6-2016, 12:00:43 GMT

Single Neuron Gradient Descent

In my experience, the gap between a conceptual understanding of how a machine learning model "learns" and a concrete, "I can do this with a pencil and paper" understanding is large. This gap is further exacerbated by the nature of popular machine learning libraries which allow you to use powerful models without knowing how they really work. In this post, I aim to close the gap above for a vanilla neural network that learns by gradient descent: we will use gradient descent to learn a weight and a bias for a single neuron. From there, when learning an entire network of millions of neurons, we just do the same thing a bunch more times. The following assumes a cursory knowledge of linear combinations, activation functions, cost functions, and how they all fit together in forward propagation.

artificial intelligence, machine learning, weight and bias, (11 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.59)

Martens, James, Grosse, Roger

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

arXiv.org Machine LearningMay-3-2016

We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.

approximation, artificial intelligence, machine learning, (19 more...)

1503.05671

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)