Goto

Collaborating Authors

 Mathematical & Statistical Methods


Practical Inexact Proximal Quasi-Newton Method with Global Complexity Analysis

arXiv.org Machine Learning

Recently several methods were proposed for sparse optimization which make careful use of second-order information [10, 28, 16, 3] to improve local convergence rates. These methods construct a composite quadratic approximation using Hessian information, optimize this approximation using a first-order method, such as coordinate descent and employ a line search to ensure sufficient descent. Here we propose a general framework, which includes slightly modified versions of existing algorithms and also a new algorithm, which uses limited memory BFGS Hessian approximations, and provide a novel global convergence rate analysis, which covers methods that solve subproblems via coordinate descent.


Linear Convergence of Variance-Reduced Stochastic Gradient without Strong Convexity

arXiv.org Machine Learning

Stochastic gradient algorithms estimate the gradient based on only one or a few samples and enjoy low computational cost per iteration. They have been widely used in large-scale optimization problems. However, stochastic gradient algorithms are usually slow to converge and achieve sub-linear convergence rates, due to the inherent variance in the gradient computation. To accelerate the convergence, some variance-reduced stochastic gradient algorithms, e.g., proximal stochastic variance-reduced gradient (Prox-SVRG) algorithm, have recently been proposed to solve strongly convex problems. Under the strongly convex condition, these variance-reduced stochastic gradient algorithms achieve a linear convergence rate. However, many machine learning problems are convex but not strongly convex. In this paper, we introduce Prox-SVRG and its projected variant called Variance-Reduced Projected Stochastic Gradient (VRPSG) to solve a class of non-strongly convex optimization problems widely used in machine learning. As the main technical contribution of this paper, we show that both VRPSG and Prox-SVRG achieve a linear convergence rate without strong convexity. A key ingredient in our proof is a Semi-Strongly Convex (SSC) inequality which is the first to be rigorously proved for a class of non-strongly convex problems in both constrained and regularized settings. Moreover, the SSC inequality is independent of algorithms and may be applied to analyze other stochastic gradient algorithms besides VRPSG and Prox-SVRG, which may be of independent interest. To the best of our knowledge, this is the first work that establishes the linear convergence rate for the variance-reduced stochastic gradient algorithms on solving both constrained and regularized problems without strong convexity.


Online Matrix Factorization via Broyden Updates

arXiv.org Machine Learning

In this paper, we propose an online algorithm to compute matrix factorizations. Proposed algorithm updates the dictionary matrix and associated coefficients using a single observation at each time. The algorithm performs low-rank updates to dictionary matrix. We derive the algorithm by defining a simple objective function to minimize whenever an observation is arrived. We extend the algorithm further for handling missing data. We also provide a mini-batch extension which enables to compute the matrix factorization on big datasets. We demonstrate the efficiency of our algorithm on a real dataset and give comparisons with well-known algorithms such as stochastic gradient matrix factorization and nonnegative matrix factorization (NMF).


Semi-Stochastic Gradient Descent Methods

arXiv.org Machine Learning

In this paper we study the problem of minimizing the average of a large number ($n$) of smooth convex loss functions. We propose a new method, S2GD (Semi-Stochastic Gradient Descent), which runs for one or several epochs in each of which a single full gradient and a random number of stochastic gradients is computed, following a geometric law. The total work needed for the method to output an $\varepsilon$-accurate solution in expectation, measured in the number of passes over data, or equivalently, in units equivalent to the computation of a single gradient of the loss, is $O((\kappa/n)\log(1/\varepsilon))$, where $\kappa$ is the condition number. This is achieved by running the method for $O(\log(1/\varepsilon))$ epochs, with a single gradient evaluation and $O(\kappa)$ stochastic gradient evaluations in each. The SVRG method of Johnson and Zhang arises as a special case. If our method is limited to a single epoch only, it needs to evaluate at most $O((\kappa/\varepsilon)\log(1/\varepsilon))$ stochastic gradients. In contrast, SVRG requires $O(\kappa/\varepsilon^2)$ stochastic gradients. To illustrate our theoretical results, S2GD only needs the workload equivalent to about 2.1 full gradient evaluations to find an $10^{-6}$-accurate solution for a problem with $n=10^9$ and $\kappa=10^3$.


Robust Vertex Classification

arXiv.org Machine Learning

For random graphs distributed according to stochastic blockmodels, a special case of latent position graphs, adjacency spectral embedding followed by appropriate vertex classification is asymptotically Bayes optimal; but this approach requires knowledge of and critically depends on the model dimension. In this paper, we propose a sparse representation vertex classifier which does not require information about the model dimension. This classifier represents a test vertex as a sparse combination of the vertices in the training set and uses the recovered coefficients to classify the test vertex. We prove consistency of our proposed classifier for stochastic blockmodels, and demonstrate that the sparse representation classifier can predict vertex labels with higher accuracy than adjacency spectral embedding approaches via both simulation studies and real data experiments. Our results demonstrate the robustness and effectiveness of our proposed vertex classifier when the model dimension is unknown.


Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields

arXiv.org Machine Learning

Conditional random fields (CRFs) [Lafferty et al., 2001] are a ubiquitous tool in natural language processing. They are used for part-of-speech tagging [McCallum et al., 2003], semantic role labeling [Cohn and Blunsom, 2005], topic modeling [Zhu and Xing, 2010], information extraction [Peng and McCallum, 2006], shallow parsing [Sha and Pereira, 2003], named-entity recognition [Settles, 2004], as well as a host of other applications in natural language processing and in other fields such as computer vision [Nowozin and Lampert, 2011]. Similar to generative Markov random field (MRF) models, CRFs allow us to model probabilistic dependencies between output variables. The key advantage of discriminative CRF models is the ability to use a very highdimensional feature set, without explicitly building a model for these features (as required by MRF models). Despite the widespread use of CRFs, a major disadvantage of these models is that they can be very slow to train and the time needed for numerical optimization in CRF models remains a bottleneck in many applications. Due to the high cost of evaluating the CRF objective function on even a single training example, it is now common to train CRFs using stochastic gradient methods [Vishwanathan et al., 2006]. These methods are advantageous over deterministic methods because on each iteration they only require computing the gradient of a single example (and not all example as in deterministic methods). Thus, if we have a data set with n training examples, the iterations of stochastic gradient methods are n times faster than deterministic methods. However, the number of stochastic gradient iterations required might be very high.


Computational Lower Bounds for Community Detection on Random Graphs

arXiv.org Machine Learning

This paper studies the problem of detecting the presence of a small dense community planted in a large Erd\H{o}s-R\'enyi random graph $\mathcal{G}(N,q)$, where the edge probability within the community exceeds $q$ by a constant factor. Assuming the hardness of the planted clique detection problem, we show that the computational complexity of detecting the community exhibits the following phase transition phenomenon: As the graph size $N$ grows and the graph becomes sparser according to $q=N^{-\alpha}$, there exists a critical value of $\alpha = \frac{2}{3}$, below which there exists a computationally intensive procedure that can detect far smaller communities than any computationally efficient procedure, and above which a linear-time procedure is statistically optimal. The results also lead to the average-case hardness results for recovering the dense community and approximating the densest $K$-subgraph.


A Generalized Reduced Linear Program for Markov Decision Processes

AAAI Conferences

Markov decision processes (MDPs) with large number of states are of high practical interest. However, conventional algorithms to solve MDP are computationally infeasible in this scenario. Approximate dynamic programming (ADP) methods tackle this issue by computing approximate solutions. A widely applied ADP method is approximate linear program (ALP) which makes use of linear function approximation and offers theoretical performance guarantees. Nevertheless, the ALP is difficult to solve due to the presence of a large number of constraints and in practice, a reduced linear program (RLP) is solved instead. The RLP has a tractable number of constraints sampled from the original constraints of the ALP. Though the RLP is known to perform well in experiments, theoretical guarantees are available only for a specific RLP obtained under idealized assumptions. In this paper, we generalize the RLP to define a generalized reduced linear program (GRLP) which has a tractable number of constraints that are obtained as positive linear combinations of the original constraints of the ALP. The main contribution of this paper is the novel theoretical framework developed to obtain error bounds for any given GRLP. Central to our framework are two max-norm contraction operators. Our result theoretically justifies linear approximation of constraints. We discuss the implication of our results in the contexts of ADP and reinforcement learning. We also demonstrate via an example in the domain of controlled queues that the experiments conform to the theory.


Sequence-Form Algorithm for Computing Stackelberg Equilibria in Extensive-Form Games

AAAI Conferences

Stackelberg equilibrium is a solution concept prescribing for a player an optimal strategy to commit to, assuming the opponent knows this commitment and plays the best response. Although this solution concept is a cornerstone of many security applications, the existing works typically do not consider situations where the players can observe and react to the actions of the opponent during the course of the game. We extend the existing algorithmic work to extensive-form games and introduce novel algorithm for computing Stackelberg equilibria that exploits the compact sequence-form representation of strategies. Our algorithm reduces the size of the linear programs from exponential in the baseline approach to linear in the size of the game tree. Experimental evaluation on randomly generated games and a security-inspired search game demonstrates significant improvement in the scalability compared to the baseline approach.


Mixed-Integer Linear Programming for Planning with Temporal Logic Tasks [Position Paper]

AAAI Conferences

We are concerned with controlling dynamical systems, such as self-driving cars and smart buildings, in a manner that guarantees that they satisfy complex task specifications. Mixed integer linear programming has recently proven to be a powerful tool for such problems, enabling the computation of optimal plans that satisfy complex temporal constraints for high-dimensional, dynamical systems. These optimization-based approaches find solutions quickly for challenging (and previously unsolvable) planning problems. Framing temporal logic planning as constrained optimization also presents exciting new areas of research.