Goto

Collaborating Authors

 Learning Graphical Models


Markov Chain Truncation for Doubly-Intractable Inference

arXiv.org Machine Learning

Computing partition functions, the normalizing constants of probability distributions, is often hard. Variants of importance sampling give unbiased estimates of a normalizer Z, however, unbiased estimates of the reciprocal 1/Z are harder to obtain. Unbiased estimates of 1/Z allow Markov chain Monte Carlo sampling of "doubly-intractable" distributions, such as the parameter posterior for Markov Random Fields or Exponential Random Graphs. We demonstrate how to construct unbiased estimates for 1/Z given access to black-box importance sampling estimators for Z. We adapt recent work on random series truncation and Markov chain coupling, producing estimators with lower variance and a higher percentage of positive estimates than before. Our debiasing algorithms are simple to implement, and have some theoretical and empirical advantages over existing methods.


Why is Differential Evolution Better than Grid Search for Tuning Defect Predictors?

arXiv.org Machine Learning

Context: One of the black arts of data mining is learning the magic parameters which control the learners. In software analytics, at least for defect prediction, several methods, like grid search and differential evolution (DE), have been proposed to learn these parameters, which has been proved to be able to improve the performance scores of learners. Objective: We want to evaluate which method can find better parameters in terms of performance score and runtime cost. Methods: This paper compares grid search to differential evolution, which is an evolutionary algorithm that makes extensive use of stochastic jumps around the search space. Results: We find that the seemingly complete approach of grid search does no better, and sometimes worse, than the stochastic search. When repeated 20 times to check for conclusion validity, DE was over 210 times faster than grid search to tune Random Forests on 17 testing data sets with F-Measure Conclusions: These results are puzzling: why does a quick partial search be just as effective as a much slower, and much more, extensive search? To answer that question, we turned to the theoretical optimization literature. Bergstra and Bengio conjecture that grid search is not more effective than more randomized searchers if the underlying search space is inherently low dimensional. This is significant since recent results show that defect prediction exhibits very low intrinsic dimensionality-- an observation that explains why a fast method like DE may work as well as a seemingly more thorough grid search. This suggests, as a future research direction, that it might be possible to peek at data sets before doing any optimization in order to match the optimization algorithm to the problem at hand.


Markov Chain Lifting and Distributed ADMM

arXiv.org Machine Learning

The time to converge to the steady state of a finite Markov chain can be greatly reduced by a lifting operation, which creates a new Markov chain on an expanded state space. For a class of quadratic objectives, we show an analogous behavior where a distributed ADMM algorithm can be seen as a lifting of Gradient Descent algorithm. This provides a deep insight for its faster convergence rate under optimal parameter tuning. We conjecture that this gain is always present, as opposed to the lifting of a Markov chain which sometimes only provides a marginal speedup.


High SNR Consistent Compressive Sensing

arXiv.org Machine Learning

High signal to noise ratio (SNR) consistency of model selection criteria in linear regression models has attracted a lot of attention recently. However, most of the existing literature on high SNR consistency deals with model order selection. Further, the limited literature available on the high SNR consistency of subset selection procedures (SSPs) is applicable to linear regression with full rank measurement matrices only. Hence, the performance of SSPs used in underdetermined linear models (a.k.a compressive sensing (CS) algorithms) at high SNR is largely unknown. This paper fills this gap by deriving necessary and sufficient conditions for the high SNR consistency of popular CS algorithms like $l_0$-minimization, basis pursuit de-noising or LASSO, orthogonal matching pursuit and Dantzig selector. Necessary conditions analytically establish the high SNR inconsistency of CS algorithms when used with the tuning parameters discussed in literature. Novel tuning parameters with SNR adaptations are developed using the sufficient conditions and the choice of SNR adaptations are discussed analytically using convergence rate analysis. CS algorithms with the proposed tuning parameters are numerically shown to be high SNR consistent and outperform existing tuning parameters in the moderate to high SNR regime.


The best kept secret about linear and logistic regression

@machinelearnbot

All the regression theory developed by statisticians over the last 200 years (related to the general linear model) is useless. Regression can be performed as accurately without statistical models, including the computation of confidence intervals (for estimates, predicted values or regression parameters). The non-statistical approach is also more robust than theory described in all statistics textbooks and taught in all statistical courses. It does not require Map-Reduce when data is really big, nor any matrix inversion, maximum likelihood estimation, or mathematical optimization (Newton algorithm). It is indeed incredibly simple, robust, easy to interpret, and easy to code (no statistical libraries required).


Parallel Markov Chain Monte Carlo for the Indian Buffet Process

arXiv.org Machine Learning

Indian Buffet Process based models are an elegant way for discovering underlying features within a data set, but inference in such models can be slow. Inferring underlying features using Markov chain Monte Carlo either relies on an uncollapsed representation, which leads to poor mixing, or on a collapsed representation, which leads to a quadratic increase in computational complexity. Existing attempts at distributing inference have introduced additional approximation within the inference procedure. In this paper we present a novel algorithm to perform asymptotically exact parallel Markov chain Monte Carlo inference for Indian Buffet Process models. We take advantage of the fact that the features are conditionally independent under the beta-Bernoulli process. Because of this conditional independence, we can partition the features into two parts: one part containing only the finitely many instantiated features and the other part containing the infinite tail of uninstantiated features. For the finite partition, parallel inference is simple given the instantiation of features. But for the infinite tail, performing uncollapsed MCMC leads to poor mixing and hence we collapse out the features. The resulting hybrid sampler, while being parallel, produces samples asymptotically from the true posterior.


Sample Efficient Feature Selection for Factored MDPs

arXiv.org Machine Learning

In reinforcement learning, the state of the real world is often represented by feature vectors. However, not all of the features may be pertinent for solving the current task. We propose Feature Selection Explore and Exploit (FS-EE), an algorithm that automatically selects the necessary features while learning a Factored Markov Decision Process, and prove that under mild assumptions, its sample complexity scales with the in-degree of the dynamics of just the necessary features, rather than the in-degree of all features. This can result in a much better sample complexity when the in-degree of the necessary features is smaller than the in-degree of all features.


Identifying Best Interventions through Online Importance Sampling

arXiv.org Machine Learning

Motivated by applications in computational advertising and systems biology, we consider the problem of identifying the best out of several possible soft interventions at a source node $V$ in an acyclic causal directed graph, to maximize the expected value of a target node $Y$ (located downstream of $V$). Our setting imposes a fixed total budget for sampling under various interventions, along with cost constraints on different types of interventions. We pose this as a best arm identification bandit problem with $K$ arms where each arm is a soft intervention at $V,$ and leverage the information leakage among the arms to provide the first gap dependent error and simple regret bounds for this problem. Our results are a significant improvement over the traditional best arm identification results. We empirically show that our algorithms outperform the state of the art in the Flow Cytometry data-set, and also apply our algorithm for model interpretation of the Inception-v3 deep net that classifies images.


Parallel Implementation of Efficient Search Schemes for the Inference of Cancer Progression Models

arXiv.org Machine Learning

The emergence and development of cancer is a consequence of the accumulation over time of genomic mutations involving a specific set of genes, which provides the cancer clones with a functional selective advantage. In this work, we model the order of accumulation of such mutations during the progression, which eventually leads to the disease, by means of probabilistic graphic models, i.e., Bayesian Networks (BNs). We investigate how to perform the task of learning the structure of such BNs, according to experimental evidence, adopting a global optimization meta-heuristics. In particular, in this work we rely on Genetic Algorithms, and to strongly reduce the execution time of the inference -- which can also involve multiple repetitions to collect statistically significant assessments of the data -- we distribute the calculations using both multi-threading and a multi-node architecture. The results show that our approach is characterized by good accuracy and specificity; we also demonstrate its feasibility, thanks to a 84x reduction of the overall execution time with respect to a traditional sequential implementation.


Dropout Inference in Bayesian Neural Networks with Alpha-divergences

arXiv.org Machine Learning

To obtain uncertainty estimates with real-world Bayesian deep learning models, practical inference approximations are needed. Dropout variational inference (VI) for example has been used for machine vision and medical applications, but VI can severely underestimates model uncertainty. Alpha-divergences are alternative divergences to VI's KL objective, which are able to avoid VI's uncertainty underestimation. But these are hard to use in practice: existing techniques can only use Gaussian approximating distributions, and require existing models to be changed radically, thus are of limited use for practitioners. We propose a re-parametrisation of the alpha-divergence objectives, deriving a simple inference technique which, together with dropout, can be easily implemented with existing models by simply changing the loss of the model. We demonstrate improved uncertainty estimates and accuracy compared to VI in dropout networks. We study our model's epistemic uncertainty far away from the data using adversarial images, showing that these can be distinguished from non-adversarial images by examining our model's uncertainty.