Goto

Collaborating Authors

 Uncertainty


Combining predictions from linear models when training and test inputs differ

arXiv.org Machine Learning

Methods for combining predictions from different models in a supervised learning setting must somehow estimate/predict the quality of a model's predictions at unknown future inputs. Many of these methods (often implicitly) make the assumption that the test inputs are identical to the training inputs, which is seldom reasonable. By failing to take into account that prediction will generally be harder for test inputs that did not occur in the training set, this leads to the selection of too complex models. Based on a novel, unbiased expression for KL divergence, we propose XAIC and its special case FAIC as versions of AIC intended for prediction that use different degrees of knowledge of the test inputs. Both methods substantially differ from and may outperform all the known versions of AIC even when the training and test inputs are iid, and are especially useful for deterministic inputs and under covariate shift. Our experiments on linear models suggest that if the test and training inputs differ substantially, then XAIC and FAIC predictively outperform AIC, BIC and several other methods including Bayesian model averaging.


Divide-and-Conquer Learning by Anchoring a Conical Hull

arXiv.org Machine Learning

We reduce a broad class of machine learning problems, usually addressed by EM or sampling, to the problem of finding the $k$ extremal rays spanning the conical hull of a data point set. These $k$ "anchors" lead to a global solution and a more interpretable model that can even outperform EM and sampling on generalization error. To find the $k$ anchors, we propose a novel divide-and-conquer learning scheme "DCA" that distributes the problem to $\mathcal O(k\log k)$ same-type sub-problems on different low-D random hyperplanes, each can be solved by any solver. For the 2D sub-problem, we present a non-iterative solver that only needs to compute an array of cosine values and its max/min entries. DCA also provides a faster subroutine for other methods to check whether a point is covered in a conical hull, which improves algorithm design in multiple dimensions and brings significant speedup to learning. We apply our method to GMM, HMM, LDA, NMF and subspace clustering, then show its competitive performance and scalability over other methods on rich datasets.


Minimax-optimal Inference from Partial Rankings

arXiv.org Machine Learning

This paper studies the problem of inferring a global preference based on the partial rankings provided by many users over different subsets of items according to the Plackett-Luce model. A question of particular interest is how to optimally assign items to users for ranking and how many item assignments are needed to achieve a target estimation error. For a given assignment of items to users, we first derive an oracle lower bound of the estimation error that holds even for the more general Thurstone models. Then we show that the Cram\'er-Rao lower bound and our upper bounds inversely depend on the spectral gap of the Laplacian of an appropriately defined comparison graph. When the system is allowed to choose the item assignment, we propose a random assignment scheme. Our oracle lower bound and upper bounds imply that it is minimax-optimal up to a logarithmic factor among all assignment schemes and the lower bound can be achieved by the maximum likelihood estimator as well as popular rank-breaking schemes that decompose partial rankings into pairwise comparisons. The numerical experiments corroborate our theoretical findings.


Structured Generative Models of Natural Source Code

arXiv.org Machine Learning

We study the problem of building generative models of natural source code (NSC); that is, source code written and understood by humans. Our primary contribution is to describe a family of generative models for NSC that have three key properties: First, they incorporate both sequential and hierarchical structure. Second, we learn a distributed representation of source code elements. Finally, they integrate closely with a compiler, which allows leveraging compiler logic and abstractions when building structure into the model. We also develop an extension that includes more complex structure, refining how the model generates identifier tokens based on what variables are currently in scope. Our models can be learned efficiently, and we show empirically that including appropriate structure greatly improves the models, measured by the probability of generating test programs.


Inferring causal structure: a quantum advantage

arXiv.org Machine Learning

These authors contributed equally to this work. The real surprise, however, is that even if one only has the ability to passively observe the early system, the quantum correlations hold signatures of the causal structure--in other words, certain types of correlation do imply causation. In a recent paper, Fitzsimons, Jones and Vedral [3] defined a function of the observed correlations which acts as a witness of direct causal influence, by ruling out a purely common-cause explanation. We here present the larger framework that places this result on an equal footing with an analogous result for common-cause relations. The problem of using observed correlations to infer causal relations is relevant to a wide variety of scientific disciplines. Yet given correlations between just two classical variables, it is impossible to determine whether they arose from a causal influence of one on the other or a common cause influencing both, unless one can implement a randomized intervention. We here consider the problem of causal inference for quantum variables. We introduce causal tomography, which unifies and generalizes conventional quantum tomography schemes to provide a complete solution to the causal inference problem using a quantum analogue of a randomized trial. We furthermore show that, in contrast to the classical case, observed quantum correlations alone can sometimes provide a solution. We implement a quantum-optical experiment that allows us to control the causal relation between two optical modes, and two measurement schemes--one with and one without randomization-- that extract this relation from the observed correlations.


Lifted Tree-Reweighted Variational Inference

arXiv.org Artificial Intelligence

We analyze variational inference for highly symmetric graphical models such as those arising from first-order probabilistic models. We first show that for these graphical models, the tree-reweighted variational objective lends itself to a compact lifted formulation which can be solved much more efficiently than the standard TRW formulation for the ground graphical model. Compared to earlier work on lifted belief propagation, our formulation leads to a convex optimization problem for lifted marginal inference and provides an upper bound on the partition function. We provide two approaches for improving the lifted TRW upper bound. The first is a method for efficiently computing maximum spanning trees in highly symmetric graphs, which can be used to optimize the TRW edge appearance probabilities. The second is a method for tightening the relaxation of the marginal polytope using lifted cycle inequalities and novel exchangeable cluster consistency constraints.


Investigation of commuting Hamiltonian in quantum Markov network

arXiv.org Artificial Intelligence

Noname manuscript No. (will be inserted by the editor) Abstract Graphical Models have various applications in science and engineering which include physics, bioinformatics, telecommunication and etc. Usage of graphical models needs complex computations in order to evaluation of marginal functions, so there are some powerful methods including mean field approximation, belief propagation algorithm and etc. Quantum graphical models have been recently developed in context of quantum information and computation, and quantum statistical physics, which is possible by generalization of classical probability theory to quantum theory. The main goal of this paper is preparing a primary generalization of Markov network, as a type of graphical models, to quantum case and applying in quantum statistical physics. We have investigated the Markov network and the role of commuting Hamiltonian terms in conditional independence with simple examples of quantum statistical physics. Keywords Graphical models ยท Quantum graphical models ยท Conditional independence ยท Quantum conditional independence ยท Commuting Hamiltonian ยท Quantum Markov network 1 Introduction In 1988, Pearl devised belief propagation algorithm to solve marginalization and other inference problems.


Guarantees and Limits of Preprocessing in Constraint Satisfaction and Reasoning

arXiv.org Artificial Intelligence

We present a first theoretical analysis of the power of polynomial-time preprocessing for important combinatorial problems from various areas in AI. We consider problems from Constraint Satisfaction, Global Constraints, Satisfiability, Nonmonotonic and Bayesian Reasoning under structural restrictions. All these problems involve two tasks: (i) identifying the structure in the input as required by the restriction, and (ii) using the identified structure to solve the reasoning task efficiently. We show that for most of the considered problems, task (i) admits a polynomial-time preprocessing to a problem kernel whose size is polynomial in a structural problem parameter of the input, in contrast to task (ii) which does not admit such a reduction to a problem kernel of polynomial size, subject to a complexity theoretic assumption. As a notable exception we show that the consistency problem for the AtMost-NValue constraint admits a polynomial kernel consisting of a quadratic number of variables and domain values. Our results provide a firm worst-case guarantees and theoretical boundaries for the performance of polynomial-time preprocessing algorithms for the considered problems.


Input Warping for Bayesian Optimization of Non-stationary Functions

arXiv.org Machine Learning

Bayesian optimization has proven to be a highly effective methodology for the global optimization of unknown, expensive and multimodal functions. The ability to accurately model distributions over functions is critical to the effectiveness of Bayesian optimization. Although Gaussian processes provide a flexible prior over functions which can be queried efficiently, there are various classes of functions that remain difficult to model. One of the most frequently occurring of these is the class of non-stationary functions. The optimization of the hyperparameters of machine learning algorithms is a problem domain in which parameters are often manually transformed a priori, for example by optimizing in "log-space," to mitigate the effects of spatially-varying length scale. We develop a methodology for automatically learning a wide family of bijective transformations or warpings of the input space using the Beta cumulative distribution function. We further extend the warping framework to multi-task Bayesian optimization so that multiple tasks can be warped into a jointly stationary space. On a set of challenging benchmark optimization tasks, we observe that the inclusion of warping greatly improves on the state-of-the-art, producing better results faster and more reliably.


Learning Latent Variable Gaussian Graphical Models

arXiv.org Machine Learning

Gaussian graphical models (GGM) have been widely used in many high-dimensional applications ranging from biological and financial data to recommender systems. Sparsity in GGM plays a central role both statistically and computationally. Unfortunately, real-world data often does not fit well to sparse graphical models. In this paper, we focus on a family of latent variable Gaussian graphical models (LVGGM), where the model is conditionally sparse given latent variables, but marginally non-sparse. In LVGGM, the inverse covariance matrix has a low-rank plus sparse structure, and can be learned in a regularized maximum likelihood framework. We derive novel parameter estimation error bounds for LVGGM under mild conditions in the high-dimensional setting. These results complement the existing theory on the structural learning, and open up new possibilities of using LVGGM for statistical inference.