Goto

Collaborating Authors

 Optimization


Accelerated Stochastic Subgradient Methods under Local Error Bound Condition

arXiv.org Machine Learning

In this paper, we propose two {\bf accelerated stochastic subgradient} methods for stochastic non-strongly convex optimization problems by leveraging a generic local error bound condition. The novelty of the proposed methods lies at smartly leveraging the recent historical solution to tackle the variance in the stochastic subgradient. The key idea of both methods is to iteratively solve the original problem approximately in a local region around a recent historical solution with size of the local region gradually decreasing as the solution approaches the optimal set. The difference of the two methods lies at how to construct the local region. The first method uses an explicit ball constraint and the second method uses an implicit regularization approach. For both methods, we establish the improved iteration complexity in a high probability for achieving an $\epsilon$-optimal solution. Besides the improved order of iteration complexity with a high probability, the proposed algorithms also enjoy a logarithmic dependence on the distance of the initial solution to the optimal set. We also consider applications in machine learning and demonstrate that the proposed algorithms enjoy faster convergence than the traditional stochastic subgradient method. For example, when applied to the $\ell_1$ regularized polyhedral loss minimization (e.g., hinge loss, absolute loss), the proposed stochastic methods have a logarithmic iteration complexity.


Kernel Approximation Methods for Speech Recognition

arXiv.org Machine Learning

We study large-scale kernel methods for acoustic modeling in speech recognition and compare their performance to deep neural networks (DNNs). We perform experiments on four speech recognition datasets, including the TIMIT and Broadcast News benchmark tasks, and compare these two types of models on frame-level performance metrics (accuracy, cross-entropy), as well as on recognition metrics (word/character error rate). In order to scale kernel methods to these large datasets, we use the random Fourier feature method of Rahimi and Recht (2007). We propose two novel techniques for improving the performance of kernel acoustic models. First, in order to reduce the number of random features required by kernel models, we propose a simple but effective method for feature selection. The method is able to explore a large number of non-linear features while maintaining a compact model more efficiently than existing approaches. Second, we present a number of frame-level metrics which correlate very strongly with recognition performance when computed on the heldout set; we take advantage of these correlations by monitoring these metrics during training in order to decide when to stop learning. This technique can noticeably improve the recognition performance of both DNN and kernel models, while narrowing the gap between them. Additionally, we show that the linear bottleneck method of Sainath et al. (2013) improves the performance of our kernel models significantly, in addition to speeding up training and making the models more compact. Together, these three methods dramatically improve the performance of kernel acoustic models, making their performance comparable to DNNs on the tasks we explored.


Neural Combinatorial Optimization with Reinforcement Learning

arXiv.org Machine Learning

This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent network using a policy gradient method. We compare learning the network parameters on a set of training graphs against learning them on individual test graphs. Despite the computational expense, without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. Applied to the KnapSack, another NP-hard problem, the same method obtains optimal solutions for instances with up to 200 items.


A Primer on Coordinate Descent Algorithms

arXiv.org Machine Learning

This particular class of algorithms has recently gained popularity due to their effectiveness in solving large-scale optimization problems in machine learning, compressed sensing, image processing, and computational statistics. Coordinate descent algorithms solve optimization problems by successively minimizing along each coordinate or coordinate hyperplane, which is ideal for parallelized and distributed computing. Avoiding detailed technicalities and proofs, this monograph gives relevant theory and examples for practitioners to effectively apply coordinate descent to modern problems in data science and engineering. To keep the primer up-to-date, we intend to publish this monograph only after no additional topics need to be added and we foresee no further major advances in the area. 1 Introduction


On-demand high-capacity ride-sharing via dynamic trip-vehicle assignment

#artificialintelligence

Edited by Michael F. Goodchild, University of California, Santa Barbara, CA, and approved November 22, 2016 (received for review July 20, 2016) Ride-sharing services can provide not only a very personalized mobility experience but also ensure efficiency and sustainability via large-scale ride pooling. Large-scale ride-sharing requires mathematical models and algorithms that can match large groups of riders to a fleet of shared vehicles in real time, a task not fully addressed by current solutions. We present a highly scalable anytime optimal algorithm and experimentally validate its performance using New York City taxi data and a shared vehicle fleet with passenger capacities of up to ten. Our results show that 2,000 vehicles (15% of the taxi fleet) of capacity 10 or 3,000 of capacity 4 can serve 98% of the demand within a mean waiting time of 2.8 min and mean trip delay of 3.5 min. Ride-sharing services are transforming urban mobility by providing timely and convenient transportation to anybody, anywhere, and anytime. Current mathematical models, however, do not fully address the potential of ride-sharing. Recently, a large-scale study highlighted some of the benefits of car pooling but was limited to static routes with two riders per vehicle (optimally) or three (with heuristics).


London Machine Learning Meetup

#artificialintelligence

It is well known that the global optimum of a MDP with finite state and action sets can be obtained through methods based on dynamic programming. Unfortunately, these techniques are known to suffer from the curse of dimensionality, which makes them infeasible for many real-world problems of interest. As a result, most research in the reinforcement learning and control theory literature has focused on obtaining approximate or locally optimal solutions. There exists a broad spectrum of such techniques, including approximate dynamic programming methods, tree search methods, local trajectory-optimization techniques, such as differential dynamic programming and iLQG, and policy search methods. In this talk I shall provide an introduction to policy search methods, which are a family of algorithms that have proven extremely popular in recent years, and which have numerous desirable properties that make them attractive in practice.


Fast Discrete Distribution Clustering Using Wasserstein Barycenter with Sparse Support

arXiv.org Machine Learning

In a variety of research areas, the weighted bag of vectors and the histogram are widely used descriptors for complex objects. Both can be expressed as discrete distributions. D2-clustering pursues the minimum total within-cluster variation for a set of discrete distributions subject to the Kantorovich-Wasserstein metric. D2-clustering has a severe scalability issue, the bottleneck being the computation of a centroid distribution, called Wasserstein barycenter, that minimizes its sum of squared distances to the cluster members. In this paper, we develop a modified Bregman ADMM approach for computing the approximate discrete Wasserstein barycenter of large clusters. In the case when the support points of the barycenters are unknown and have low cardinality, our method achieves high accuracy empirically at a much reduced computational cost. The strengths and weaknesses of our method and its alternatives are examined through experiments, and we recommend scenarios for their respective usage. Moreover, we develop both serial and parallelized versions of the algorithm. By experimenting with large-scale data, we demonstrate the computational efficiency of the new methods and investigate their convergence properties and numerical stability. The clustering results obtained on several datasets in different domains are highly competitive in comparison with some widely used methods in the corresponding areas.


Neuro-dynamic Programming: Building human curiosity into artificial intelligence - Analytics Magazine

#artificialintelligence

The media and watercooler chatter alike increasingly focus on how advances in machine learning and artificial intelligence (AI) are boosting the ability of predictive analytics to benefit businesses' bottom lines. Some of that talk ponders the potential for smart machines to replace humans in higher-complexity jobs. No doubt, smart machines are getting smarter. But even the smartest machines still lack fundamental human characteristics that are absolutely critical to enabling people to solve problems. One of these key capabilities is curiosity – surely a computer can't replicate that, can it?


Optimal Low-Rank Dynamic Mode Decomposition

arXiv.org Machine Learning

Dynamic Mode Decomposition (DMD) has emerged as a powerful tool for analyzing the dynamics of non-linear systems from experimental datasets. Recently, several attempts have extended DMD to the context of low-rank approximations. This extension is of particular interest for reduced-order modeling in various applicative domains, e.g. for climate prediction, to study molecular dynamics or micro-electromechanical devices. This low-rank extension takes the form of a non-convex optimization problem. To the best of our knowledge, only sub-optimal algorithms have been proposed in the literature to compute the solution of this problem. In this paper, we prove that there exists a closed-form optimal solution to this problem and design an effective algorithm to compute it based on Singular Value Decomposition (SVD). A toy-example illustrates the gain in performance of the proposed algorithm compared to state-of-the-art techniques.


A Matrix Factorization Approach for Learning Semidefinite-Representable Regularizers

arXiv.org Machine Learning

Regularization techniques are widely employed in optimization-based approaches for solving ill-posed inverse problems in data analysis and scientific computing. These methods are based on augmenting the objective with a penalty function, which is specified based on prior domain-specific expertise to induce a desired structure in the solution. We consider the problem of learning suitable regularization functions from data in settings in which precise domain knowledge is not directly available. Previous work under the title of `dictionary learning' or `sparse coding' may be viewed as learning a regularization function that can be computed via linear programming. We describe generalizations of these methods to learn regularizers that can be computed and optimized via semidefinite programming. Our framework for learning such semidefinite regularizers is based on obtaining structured factorizations of data matrices, and our algorithmic approach for computing these factorizations combines recent techniques for rank minimization problems along with an operator analog of Sinkhorn scaling. Under suitable conditions on the input data, our algorithm provides a locally linearly convergent method for identifying the correct regularizer that promotes the type of structure contained in the data. Our analysis is based on the stability properties of Operator Sinkhorn scaling and their relation to geometric aspects of determinantal varieties (in particular tangent spaces with respect to these varieties). The regularizers obtained using our framework can be employed effectively in semidefinite programming relaxations for solving inverse problems.