Goto

Collaborating Authors

 Optimization


Analysis of Nuclear Norm Regularization for Full-rank Matrix Completion

arXiv.org Machine Learning

In this paper, we provide a theoretical analysis of the nuclear-norm regularized least squares for full-rank matrix completion. Although similar formulations have been examined by previous studies, their results are unsatisfactory because only additive upper bounds are provided. Under the assumption that the top eigenspaces of the target matrix are incoherent, we derive a relative upper bound for recovering the best low-rank approximation of the unknown matrix. Our relative upper bound is tighter than previous additive bounds of other methods if the mass of the target matrix is concentrated on its top eigenspaces, and also implies perfect recovery if it is low-rank. The analysis is built upon the optimality condition of the regularized formulation and existing guarantees for low-rank matrix completion. To the best of our knowledge, this is first time such a relative bound is proved for the regularized formulation of matrix completion.


The Power of Randomization: Distributed Submodular Maximization on Massive Datasets

arXiv.org Artificial Intelligence

A wide variety of problems in machine learning, including exemplar clustering, document summarization, and sensor placement, can be cast as constrained submodular maximization problems. Unfortunately, the resulting submodular optimization problems are often too large to be solved on a single machine. We develop a simple distributed algorithm that is embarrassingly parallel and it achieves provable, constant factor, worst-case approximation guarantees. In our experiments, we demonstrate its efficiency in large problems with different kinds of constraints with objective values always close to what is achievable in the centralized setting.


Fast optimization of Multithreshold Entropy Linear Classifier

arXiv.org Machine Learning

Many methods of speeding up the kernel density estimator's (KDE) querying process has been proposed in the literature [12, 14, 6]. As op-1 timization problem introduced in Multithreshold Entropy Linear Classifier [5] is closely related to the equations of KDE it appears natural that similar techniques can be used to simplify its computations with a bounded error. Importance of such reductions comes from the high (quadratic) complexity of the evaluation of functions required during training of this model which makes it hard to use for any dataset with more than a thousand points. In this paper we investigate two such approaches, first - sorting and discarding, which ignores computations of similarities between points that are too far away to have big impact on the function's value, second - binning, which smooths the function construction in order to heavily reduce amount of unique points. Both these methods are introduced in an adaptive manner so the optimization process have fixed error bound despite many different linear projections being analyzed during the training phase. We also show a very simple method which enables to use a wide range of optimization algorithms even though proposed model requires optimization with a specific constraints (sphere bounded).


Relax, no need to round: integrality of clustering formulations

arXiv.org Machine Learning

We study exact recovery conditions for convex relaxations of point cloud clustering problems, focusing on two of the most common optimization problems for unsupervised clustering: $k$-means and $k$-median clustering. Motivations for focusing on convex relaxations are: (a) they come with a certificate of optimality, and (b) they are generic tools which are relatively parameter-free, not tailored to specific assumptions over the input. More precisely, we consider the distributional setting where there are $k$ clusters in $\mathbb{R}^m$ and data from each cluster consists of $n$ points sampled from a symmetric distribution within a ball of unit radius. We ask: what is the minimal separation distance between cluster centers needed for convex relaxations to exactly recover these $k$ clusters as the optimal integral solution? For the $k$-median linear programming relaxation we show a tight bound: exact recovery is obtained given arbitrarily small pairwise separation $\epsilon > 0$ between the balls. In other words, the pairwise center separation is $\Delta > 2+\epsilon$. Under the same distributional model, the $k$-means LP relaxation fails to recover such clusters at separation as large as $\Delta = 4$. Yet, if we enforce PSD constraints on the $k$-means LP, we get exact cluster recovery at center separation $\Delta > 2\sqrt2(1+\sqrt{1/m})$. In contrast, common heuristics such as Lloyd's algorithm (a.k.a. the $k$-means algorithm) can fail to recover clusters in this setting; even with arbitrarily large cluster separation, k-means++ with overseeding by any constant factor fails with high probability at exact cluster recovery. To complement the theoretical analysis, we provide an experimental study of the recovery guarantees for these various methods, and discuss several open problems which these experiments suggest.


Quick sensitivity analysis for incremental data modification and its application to leave-one-out CV in linear classification problems

arXiv.org Machine Learning

We introduce a novel sensitivity analysis framework for large scale classification problems that can be used when a small number of instances are incrementally added or removed. For quickly updating the classifier in such a situation, incremental learning algorithms have been intensively studied in the literature. Although they are much more efficient than solving the optimization problem from scratch, their computational complexity yet depends on the entire training set size. It means that, if the original training set is large, completely solving an incremental learning problem might be still rather expensive. To circumvent this computational issue, we propose a novel framework that allows us to make an inference about the updated classifier without actually re-optimizing it. Specifically, the proposed framework can quickly provide a lower and an upper bounds of a quantity on the unknown updated classifier. The main advantage of the proposed framework is that the computational cost of computing these bounds depends only on the number of updated instances. This property is quite advantageous in a typical sensitivity analysis task where only a small number of instances are updated. In this paper we demonstrate that the proposed framework is applicable to various practical sensitivity analysis tasks, and the bounds provided by the framework are often sufficiently tight for making desired inferences.


Efficient SDP Inference for Fully-connected CRFs Based on Low-rank Decomposition

arXiv.org Machine Learning

Conditional Random Fields (CRF) have been widely used in a variety of computer vision tasks. Conventional CRFs typically define edges on neighboring image pixels, resulting in a sparse graph such that efficient inference can be performed. However, these CRFs fail to model long-range contextual relationships. Fully-connected CRFs have thus been proposed. While there are efficient approximate inference methods for such CRFs, usually they are sensitive to initialization and make strong assumptions. In this work, we develop an efficient, yet general algorithm for inference on fully-connected CRFs. The algorithm is based on a scalable SDP algorithm and the low- rank approximation of the similarity/kernel matrix. The core of the proposed algorithm is a tailored quasi-Newton method that takes advantage of the low-rank matrix approximation when solving the specialized SDP dual problem. Experiments demonstrate that our method can be applied on fully-connected CRFs that cannot be solved previously, such as pixel-level image co-segmentation.


Proximal operators for multi-agent path planning

arXiv.org Artificial Intelligence

We address the problem of planning collision-free paths for multiple agents using optimization methods known as proximal algorithms. Recently this approach was explored in Bento et al. 2013, which demonstrated its ease of parallelization and decentralization, the speed with which the algorithms generate good quality solutions, and its ability to incorporate different proximal operators, each ensuring that paths satisfy a desired property. Unfortunately, the operators derived only apply to paths in 2D and require that any intermediate waypoints we might want agents to follow be preassigned to specific agents, limiting their range of applicability. In this paper we resolve these limitations. We introduce new operators to deal with agents moving in arbitrary dimensions that are faster to compute than their 2D predecessors and we introduce landmarks, space-time positions that are automatically assigned to the set of agents under different optimality criteria. Finally, we report the performance of the new operators in several numerical experiments.


Hyperparameter Search in Machine Learning

arXiv.org Machine Learning

Machine learning research focuses on the development of methods that are capable of capturing some element of interest from a given data set. Such elements include but are not limited to coherent structures within data (clustering) or the ability to predict certain target values based on given characteristics, which may be discrete (classification) or continuous (regression). A large variety of learning methods exist, ranging from biologically inspired neural networks [7] over kernel methods [29] to ensemble models [9, 11]. A common trait in these methods is that they are parameterized by a set of hyperparameters ฮป, which must be set appropriately by the user to maximize the usefulness of the learning approach. Hyperparameters are used to configure various aspects of the learning algorithm and can have wildly varying effects on the resulting model and its performance. Hyperparameter search is commonly performed manually, via rules-of-thumb [19, 20] or by testing sets of hyperparameters on a predefined grid [28]. These approaches leave much to be desired in terms of reproducibility and are impractical when the number of hyperparameters is large [10]. Due to these flaws, the idea of automating hyperparameter search is receiving increasing amounts of attention in machine learning, for instance via benchmarking suites [15] and various initiatives.


Early Stopping is Nonparametric Variational Inference

arXiv.org Machine Learning

We show that unconverged stochastic gradient descent can be interpreted as a procedure that samples from a nonparametric variational approximate posterior distribution. This distribution is implicitly defined as the transformation of an initial distribution by a sequence of optimization updates. By tracking the change in entropy over this sequence of transformations during optimization, we form a scalable, unbiased estimate of the variational lower bound on the log marginal likelihood. We can use this bound to optimize hyperparameters instead of using cross-validation. This Bayesian interpretation of SGD suggests improved, overfitting-resistant optimization procedures, and gives a theoretical foundation for popular tricks such as early stopping and ensembling. We investigate the properties of this marginal likelihood estimator on neural network models.


Inferring Sentiment from Web Images with Joint Inference on Visual and Social Cues: A Regulated Matrix Factorization Approach

AAAI Conferences

In this paper, we study the problem of understanding human sentiments from large scale collection of Internet images based on both image features and contextual social network information (such as friend comments and user description). Despite the great strides in analyzing user sentiment based on text information, the analysis of sentiment behind the image content has largely been ignored. Thus, we extend the significant advances in text-based sentiment prediction tasks to the higher-level challenge of predicting the underlying sentiments behind the images. We show that neither visual features nor the textual features are by themselves sufficient for accurate sentiment labeling. Thus, we provide a way of using both of them. We leverage the low-level visual features and mid-level attributes of an image, and formulate sentiment prediction problem as a non-negative matrix tri-factorization framework, which has the flexibility to incorporate multiple modalities of information and the capability to learn from heterogeneous features jointly. We develop an optimization algorithm for finding a local-optima solution under the proposed framework. With experiments on two large-scale datasets, we show that the proposed method improves significantly over existing state-of-the-art methods.