Gradient Descent
Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences
Recht, Benjamin, Re, Christopher
Randomized algorithms that base iteration-level decisions on samples from some pool are ubiquitous in machine learning and optimization. Examples include stochastic gradient descent and randomized coordinate descent. This paper makes progress at theoretically evaluating the difference in performance between sampling with- and without-replacement in such algorithms. Focusing on least means squares optimization, we formulate a noncommutative arithmetic-geometric mean inequality that would prove that the expected convergence rate of without-replacement sampling is faster than that of with-replacement sampling. We demonstrate that this inequality holds for many classes of random matrices and for some pathological examples as well. We provide a deterministic worst-case bound on the gap between the discrepancy between the two sampling models, and explore some of the impediments to proving this inequality in full generality. We detail the consequences of this inequality for stochastic gradient descent and the randomized Kaczmarz algorithm for solving linear systems.
Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities
Yao, Angela, Gall, Juergen, Gool, Luc V., Urtasun, Raquel
A common approach for handling the complexity and inherent ambiguities of 3D human pose estimation is to use pose priors learned from training data. Existing approaches however, are either too simplistic (linear), too complex to learn, or can only learn latent spaces from "simple data", i.e., single activities such as walking or running. In this paper, we present an efficient stochastic gradient descent algorithm that is able to learn probabilistic non-linear latent spaces composed of multiple activities. Furthermore, we derive an incremental algorithm for the online setting which can update the latent space without extensive relearning. We demonstrate the effectiveness of our approach on the task of monocular and multi-view tracking and show that our approach outperforms the state-of-the-art.
Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
Moulines, Eric, Bach, Francis R.
We consider the minimization of a convex objective function defined on a Hilbert space, which is only available through unbiased estimates of its gradients. This problem includes standard machine learning algorithms such as kernel logistic regression and least-squares regression, and is commonly referred to as a stochastic approximation problem in the operations research community. We provide a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent (a.k.a.~Robbins-Monro algorithm) as well as a simple modification where iterates are averaged (a.k.a.~Polyak-Ruppert averaging). Our analysis suggests that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate in the strongly convex case, is not robust to the lack of strong convexity or the setting of the proportionality constant. This situation is remedied when using slower decays together with averaging, robustly leading to the optimal rate of convergence. We illustrate our theoretical results with simulations on synthetic and standard datasets.
Learning a Distance Metric from a Network
Shaw, Blake, Huang, Bert, Jebara, Tony
Many real-world networks are described by both connectivity information and features for every node. To better model and understand these networks, we present structure preserving metric learning (SPML), an algorithm for learning a Mahalanobis distance metric from a network such that the learned distances are tied to the inherent connectivity structure of the network. Like the graph embedding algorithm structure preserving embedding, SPML learns a metric which is structure preserving, meaning a connectivity algorithm such as k-nearest neighbors will yield the correct connectivity when applied using the distances from the learned metric. We show a variety of synthetic and real-world experiments where SPML predicts link patterns from node features more accurately than standard techniques. We further demonstrate a method for optimizing SPML based on stochastic gradient descent which removes the running-time dependency on the size of the network and allows the method to easily scale to networks of thousands of nodes and millions of edges.
Distributed Delayed Stochastic Optimization
Agarwal, Alekh, Duchi, John C.
We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show $n$-node architectures whose optimization error in stochastic problems---in spite of asynchronous delays---scales asymptotically as $\order(1 / \sqrt{nT})$, which is known to be optimal even in the absence of delays.
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
Recht, Benjamin, Re, Christopher, Wright, Stephen, Niu, Feng
Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented *without any locking*. We present an update scheme called Hogwild which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then Hogwild achieves a nearly optimal rate of convergence. We demonstrate experimentally that Hogwild outperforms alternative schemes that use locking by an order of magnitude.
Stability Conditions for Online Learnability
Ross, Stephane, Bagnell, J. Andrew
Stability is a general notion that quantifies the sensitivity of a learning algorithm's output to small change in the training dataset (e.g. deletion or replacement of a single training sample). Such conditions have recently been shown to be more powerful to characterize learnability in the general learning setting under i.i.d. samples where uniform convergence is not necessary for learnability, but where stability is both sufficient and necessary for learnability. We here show that similar stability conditions are also sufficient for online learnability, i.e. whether there exists a learning algorithm such that under any sequence of examples (potentially chosen adversarially) produces a sequence of hypotheses that has no regret in the limit with respect to the best hypothesis in hindsight. We introduce online stability, a stability condition related to uniform-leave-one-out stability in the batch setting, that is sufficient for online learnability. In particular we show that popular classes of online learners, namely algorithms that fall in the category of Follow-the-(Regularized)-Leader, Mirror Descent, gradient-based methods and randomized algorithms like Weighted Majority and Hedge, are guaranteed to have no regret if they have such online stability property. We provide examples that suggest the existence of an algorithm with such stability condition might in fact be necessary for online learnability. For the more restricted binary classification setting, we establish that such stability condition is in fact both sufficient and necessary. We also show that for a large class of online learnable problems in the general learning setting, namely those with a notion of sub-exponential covering, no-regret online algorithms that have such stability condition exists.
Simulated Annealing Based Influence Maximization in Social Networks
Jiang, Qingye (Peking University) | Song, Guojie (Peking University) | Gao, Cong (Nanyang Technological University) | Wang, Yu (Peking University) | Si, Wenjun (Peking University) | Xie, Kunqing (Peking University)
The problem of influence maximization, i.e., mining top-k influential nodes from a social network such that the spread of influence in the network is maximized, is NP-hard. Most of the existing algorithms for the prob- lem are based on greedy algorithm. Although greedy algorithm can achieve a good approximation, it is computational expensive. In this paper, we propose a totally different approach based on Simulated Annealing(SA) for the influence maximization problem. This is the first SA based algorithm for the problem. Additionally, we propose two heuristic methods to accelerate the con- vergence process of SA, and a new method of comput- ing influence to speed up the proposed algorithm. Experimental results on four real networks show that the proposed algorithms run faster than the state-of-the-art greedy algorithm by 2-3 orders of magnitude while being able to improve the accuracy of greedy algorithm.
Wsabie: Scaling Up to Large Vocabulary Image Annotation
Weston, Jason (Google Research) | Bengio, Samy (Google Research) | Usunier, Nicolas (Universite de Paris 6)
Weighted Pairwise Classification (OWPC) loss [Usunier et al., 2009] which has been shown to be state-of-the-art on Image annotation datasets are becoming larger and (small) text retrieval tasks. WARP uses stochastic gradient larger, with tens of millions of images and tens descent and a novel sampling trick to approximate ranks resulting of thousands of possible annotations. We propose in an efficient online optimization strategy which we a strongly performing method that scales to show is superior to standard stochastic gradient descent applied such datasets by simultaneously learning to optimize to the same loss, enabling us to train on datasets that precision at the top of the ranked list of annotations do not even fit in memory.
Online Identification and Tracking of Subspaces from Highly Incomplete Information
Balzano, Laura, Nowak, Robert, Recht, Benjamin
This work presents GROUSE (Grassmanian Rank-One Update Subspace Estimation), an efficient online algorithm for tracking subspaces from highly incomplete observations. GROUSE requires only basic linear algebraic manipulations at each iteration, and each subspace update can be performed in linear time in the dimension of the subspace. The algorithm is derived by analyzing incremental gradient descent on the Grassmannian manifold of subspaces. With a slight modification, GROUSE can also be used as an online incremental algorithm for the matrix completion problem of imputing missing entries of a low-rank matrix. GROUSE performs exceptionally well in practice both in tracking subspaces and as an online algorithm for matrix completion.