Goto

Collaborating Authors

 Statistical Learning


Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection

arXiv.org Machine Learning

There has been significant recent work on the theory and application of randomized coordinate descent algorithms, beginning with the work of Nesterov [SIAM J. Optim., 22(2), 2012], who showed that a random-coordinate selection rule achieves the same convergence rate as the Gauss-Southwell selection rule. This result suggests that we should never use the Gauss-Southwell rule, as it is typically much more expensive than random selection. However, the empirical behaviours of these algorithms contradict this theoretical result: in applications where the computational costs of the selection rules are comparable, the Gauss-Southwell selection rule tends to perform substantially better than random coordinate selection. We give a simple analysis of the Gauss-Southwell rule showing that---except in extreme cases---it's convergence rate is faster than choosing random coordinates. Further, in this work we (i) show that exact coordinate optimization improves the convergence rate for certain sparse problems, (ii) propose a Gauss-Southwell-Lipschitz rule that gives an even faster convergence rate given knowledge of the Lipschitz constants of the partial derivatives, (iii) analyze the effect of approximate Gauss-Southwell rules, and (iv) analyze proximal-gradient variants of the Gauss-Southwell rule.


Bootstrap Bias Corrections for Ensemble Methods

arXiv.org Machine Learning

This paper examines the use of a residual bootstrap for bias correction in machine learning regression methods. Accounting for bias is an important obstacle in recent efforts to develop statistical inference for machine learning methods. We demonstrate empirically that the proposed bootstrap bias correction can lead to substantial improvements in both bias and predictive accuracy. In the context of ensembles of trees, we show that this correction can be approximated at only double the cost of training the original ensemble without introducing additional variance. Our method is shown to improve test-set accuracy over random forests by up to 70\% on example problems from the UCI repository.


Mutual Dependence: A Novel Method for Computing Dependencies Between Random Variables

arXiv.org Machine Learning

In data science, it is often required to estimate dependencies between different data sources. These dependencies are typically calculated using Pearson's correlation, distance correlation, and/or mutual information. However, none of these measures satisfy all the Granger's axioms for an "ideal measure". One such ideal measure, proposed by Granger himself, calculates the Bhattacharyya distance between the joint probability density function (pdf) and the product of marginal pdfs. We call this measure the mutual dependence. However, to date this measure has not been directly computable from data. In this paper, we use our recently introduced maximum likelihood non-parametric estimator for band-limited pdfs, to compute the mutual dependence directly from the data. We construct the estimator of mutual dependence and compare its performance to standard measures (Pearson's and distance correlation) for different known pdfs by computing convergence rates, computational complexity, and the ability to capture nonlinear dependencies. Our mutual dependence estimator requires fewer samples to converge to theoretical values, is faster to compute, and captures more complex dependencies than standard measures.


Automatic Inference for Inverting Software Simulators via Probabilistic Programming

arXiv.org Machine Learning

Models of complex systems are often formalized as sequential software simulators: computationally intensive programs that iteratively build up probable system configurations given parameters and initial conditions. These simulators enable modelers to capture effects that are difficult to characterize analytically or summarize statistically. However, in many real-world applications, these simulations need to be inverted to match the observed data. This typically requires the custom design, derivation and implementation of sophisticated inversion algorithms. Here we give a framework for inverting a broad class of complex software simulators via probabilistic programming and automatic inference, using under 20 lines of probabilistic code. Our approach is based on a formulation of inversion as approximate inference in a simple sequential probabilistic model. We implement four inference strategies, including Metropolis-Hastings, a sequentialized Metropolis-Hastings scheme, and a particle Markov chain Monte Carlo scheme, requiring 4 or fewer lines of probabilistic code each. We demonstrate our framework by applying it to invert a real geological software simulator from the oil and gas industry.


Robust PCA: Optimization of the Robust Reconstruction Error over the Stiefel Manifold

arXiv.org Machine Learning

It is well known that Principal Component Analysis (PCA) is strongly affected by outliers and a lot of effort has been put into robustification of PCA. In this paper we present a new algorithm for robust PCA minimizing the trimmed reconstruction error. By directly minimizing over the Stiefel manifold, we avoid deflation as often used by projection pursuit methods. In distinction to other methods for robust PCA, our method has no free parameter and is computationally very efficient. We illustrate the performance on various datasets including an application to background modeling and subtraction. Our method performs better or similar to current state-of-the-art methods while being faster.


Proximal Algorithms in Statistics and Machine Learning

arXiv.org Machine Learning

In this paper we develop proximal methods for statistical learning. Proximal point algorithms are useful in statistics and machine learning for obtaining optimization solutions for composite functions. Our approach exploits closed-form solutions of proximal operators and envelope representations based on the Moreau, Forward-Backward, Douglas-Rachford and Half-Quadratic envelopes. Envelope representations lead to novel proximal algorithms for statistical optimisation of composite objective functions which include both non-smooth and non-convex objectives. We illustrate our methodology with regularized Logistic and Poisson regression and non-convex bridge penalties with a fused lasso norm. We provide a discussion of convergence of non-descent algorithms with acceleration and for non-convex functions. Finally, we provide directions for future research.


On the Computational Complexity of High-Dimensional Bayesian Variable Selection

arXiv.org Machine Learning

We study the computational complexity of Markov chain Monte Carlo (MCMC) methods for high-dimensional Bayesian linear regression under sparsity constraints. We first show that a Bayesian approach can achieve variable-selection consistency under relatively mild conditions on the design matrix. We then demonstrate that the statistical criterion of posterior concentration need not imply the computational desideratum of rapid mixing of the MCMC algorithm. By introducing a truncated sparsity prior for variable selection, we provide a set of conditions that guarantee both variable-selection consistency and rapid mixing of a particular Metropolis-Hastings algorithm. The mixing time is linear in the number of covariates up to a logarithmic factor. Our proof controls the spectral gap of the Markov chain by constructing a canonical path ensemble that is inspired by the steps taken by greedy algorithms for variable selection.


Signal Recovery on Graphs: Variation Minimization

arXiv.org Machine Learning

We consider the problem of signal recovery on graphs as graphs model data with complex structure as signals on a graph. Graph signal recovery implies recovery of one or multiple smooth graph signals from noisy, corrupted, or incomplete measurements. We propose a graph signal model and formulate signal recovery as a corresponding optimization problem. We provide a general solution by using the alternating direction methods of multipliers. We next show how signal inpainting, matrix completion, robust principal component analysis, and anomaly detection all relate to graph signal recovery, and provide corresponding specific solutions and theoretical analysis. Finally, we validate the proposed methods on real-world recovery problems, including online blog classification, bridge condition identification, temperature estimation, recommender system, and expert opinion combination of online blog classification.


Spectral MLE: Top-$K$ Rank Aggregation from Pairwise Comparisons

arXiv.org Machine Learning

This paper explores the preference-based top-$K$ rank aggregation problem. Suppose that a collection of items is repeatedly compared in pairs, and one wishes to recover a consistent ordering that emphasizes the top-$K$ ranked items, based on partially revealed preferences. We focus on the Bradley-Terry-Luce (BTL) model that postulates a set of latent preference scores underlying all items, where the odds of paired comparisons depend only on the relative scores of the items involved. We characterize the minimax limits on identifiability of top-$K$ ranked items, in the presence of random and non-adaptive sampling. Our results highlight a separation measure that quantifies the gap of preference scores between the $K^{\text{th}}$ and $(K+1)^{\text{th}}$ ranked items. The minimum sample complexity required for reliable top-$K$ ranking scales inversely with the separation measure irrespective of other preference distribution metrics. To approach this minimax limit, we propose a nearly linear-time ranking scheme, called \emph{Spectral MLE}, that returns the indices of the top-$K$ items in accordance to a careful score estimate. In a nutshell, Spectral MLE starts with an initial score estimate with minimal squared loss (obtained via a spectral method), and then successively refines each component with the assistance of coordinate-wise MLEs. Encouragingly, Spectral MLE allows perfect top-$K$ item identification under minimal sample complexity. The practical applicability of Spectral MLE is further corroborated by numerical experiments.


Compositional Vector Space Models for Knowledge Base Completion

arXiv.org Machine Learning

Knowledge base (KB) completion adds new facts to a KB by making inferences from existing facts, for example by inferring with high likelihood nationality(X,Y) from bornIn(X,Y). Most previous methods infer simple one-hop relational synonyms like this, or use as evidence a multi-hop relational path treated as an atomic feature, like bornIn(X,Z) -> containedIn(Z,Y). This paper presents an approach that reasons about conjunctions of multi-hop relations non-atomically, composing the implications of a path using a recursive neural network (RNN) that takes as inputs vector embeddings of the binary relation in the path. Not only does this allow us to generalize to paths unseen at training time, but also, with a single high-capacity RNN, to predict new relation types not seen when the compositional model was trained (zero-shot learning). We assemble a new dataset of over 52M relational triples, and show that our method improves over a traditional classifier by 11%, and a method leveraging pre-trained embeddings by 7%.