Goto

Collaborating Authors

 Genre


Structured Priors for Structure Learning

arXiv.org Artificial Intelligence

Traditional approaches to Bayes net structure learning typically assume little regularity in graph structure other than sparseness. However, in many cases, we expect more systematicity: variables in real-world systems often group into classes that predict the kinds of probabilistic dependencies they participate in. Here we capture this form of prior knowledge in a hierarchical Bayesian framework, and exploit it to enable structure learning and type discovery from small datasets. Specifically, we present a nonparametric generative model for directed acyclic graphs as a prior for Bayes net structure learning. Our model assumes that variables come in one or more classes and that the prior probability of an edge existing between two variables is a function only of their classes. We derive an MCMC algorithm for simultaneous inference of the number of classes, the class assignments of variables, and the Bayes net structure over variables. For several realistic, sparse datasets, we show that the bias towards systematicity of connections provided by our model yields more accurate learned networks than a traditional, uniform prior approach, and that the classes found by our model are appropriate.


Output Space Search for Structured Prediction

arXiv.org Artificial Intelligence

We consider a framework for structured prediction based on search in the space of complete structured outputs. Given a structured input, an output is produced by running a time-bounded search procedure guided by a learned cost function, and then returning the least cost output uncovered during the search. This framework can be instantiated for a wide range of search spaces and search procedures, and easily incorporates arbitrary structured-prediction loss functions. In this paper, we make two main technical contributions. First, we define the limited-discrepancy search space over structured outputs, which is able to leverage powerful classification learning algorithms to improve the search space quality. Second, we give a generic cost function learning approach, where the key idea is to learn a cost function that attempts to mimic the behavior of conducting searches guided by the true loss function. Our experiments on six benchmark domains demonstrate that using our framework with only a small amount of search is sufficient for significantly improving on state-of-the-art structured-prediction performance.


Variable noise and dimensionality reduction for sparse Gaussian processes

arXiv.org Machine Learning

The sparse pseudo-input Gaussian process (SPGP) is a new approximation method for speeding up GP regression in the case of a large number of data points N. The approximation is controlled by the gradient optimization of a small set of M `pseudo-inputs', thereby reducing complexity from N^3 to NM^2. One limitation of the SPGP is that this optimization space becomes impractically big for high dimensional data sets. This paper addresses this limitation by performing automatic dimensionality reduction. A projection of the input space to a low dimensional space is learned in a supervised manner, alongside the pseudo-inputs, which now live in this reduced space. The paper also investigates the suitability of the SPGP for modeling data with input-dependent noise. A further extension of the model is made to make it even more powerful in this regard - we learn an uncertainty parameter for each pseudo-input. The combination of sparsity, reduced dimension, and input-dependent noise makes it possible to apply GPs to much larger and more complex data sets than was previously practical. We demonstrate the benefits of these methods on several synthetic and real world problems.


Ranking by Dependence - A Fair Criteria

arXiv.org Machine Learning

Estimating the dependences between random variables, and ranking them accordingly, is a prevalent problem in machine learning. Pursuing frequentist and information-theoretic approaches, we first show that the p-value and the mutual information can fail even in simplistic situations. We then propose two conditions for regularizing an estimator of dependence, which leads to a simple yet effective new measure. We discuss its advantages and compare it to well-established model-selection criteria. Apart from that, we derive a simple constraint for regularizing parameter estimates in a graphical model. This results in an analytical approximation for the optimal value of the equivalent sample size, which agrees very well with the more involved Bayesian approach in our experiments.


Bayesian Random Fields: The Bethe-Laplace Approximation

arXiv.org Machine Learning

While learning the maximum likelihood value of parameters of an undirected graphical model is hard, modelling the posterior distribution over parameters given data is harder. Yet, undirected models are ubiquitous in computer vision and text modelling (e.g. conditional random fields). But where Bayesian approaches for directed models have been very successful, a proper Bayesian treatment of undirected models in still in its infant stages. We propose a new method for approximating the posterior of the parameters given data based on the Laplace approximation. This approximation requires the computation of the covariance matrix over features which we compute using the linear response approximation based in turn on loopy belief propagation. We develop the theory for conditional and 'unconditional' random fields with or without hidden variables. In the conditional setting we introduce a new variant of bagging suitable for structured domains. Here we run the loopy max-product algorithm on a 'super-graph' composed of graphs for individual models sampled from the posterior and connected by constraints. Experiments on real world data validate the proposed methods.


Bayesian Multicategory Support Vector Machines

arXiv.org Machine Learning

We show that the multi-class support vector machine (MSVM) proposed by Lee et al. (2004) can be viewed as a MAP estimation procedure under an appropriate probabilistic interpretation of the classifier. We also show that this interpretation can be extended to a hierarchical Bayesian architecture and to a fully-Bayesian inference procedure for multiclass classification based on data augmentation. We present empirical results that show that the advantages of the Bayesian formalism are obtained without a loss in classification accuracy.


Predicting Conditional Quantiles via Reduction to Classification

arXiv.org Machine Learning

We show how to reduce the process of predicting general order statistics (and the median in particular) to solving classification. The accompanying theoretical statement shows that the regret of the classifier bounds the regret of the quantile regression under a quantile loss. We also test this reduction empirically against existing quantile regression methods on large real-world datasets and discover that it provides state-of-the-art performance.


Faster Gaussian Summation: Theory and Experiment

arXiv.org Machine Learning

We provide faster algorithms for the problem of Gaussian summation, which occurs in many machine learning methods. We develop two new extensions - an O(Dp) Taylor expansion for the Gaussian kernel with rigorous error bounds and a new error control scheme integrating any arbitrary approximation method - within the best discretealgorithmic framework using adaptive hierarchical data structures. We rigorously evaluate these techniques empirically in the context of optimal bandwidth selection in kernel density estimation, revealing the strengths and weaknesses of current state-of-the-art approaches for the first time. Our results demonstrate that the new error control scheme yields improved performance, whereas the series expansion approach is only effective in low dimensions (five or less).


Gibbs Sampling for (Coupled) Infinite Mixture Models in the Stick Breaking Representation

arXiv.org Machine Learning

Nonparametric Bayesian approaches to clustering, information retrieval, language modeling and object recognition have recently shown great promise as a new paradigm for unsupervised data analysis. Most contributions have focused on the Dirichlet process mixture models or extensions thereof for which efficient Gibbs samplers exist. In this paper we explore Gibbs samplers for infinite complexity mixture models in the stick breaking representation. The advantage of this representation is improved modeling flexibility. For instance, one can design the prior distribution over cluster sizes or couple multiple infinite mixture models (e.g. over time) at the level of their parameters (i.e. the dependent Dirichlet process model). However, Gibbs samplers for infinite mixture models (as recently introduced in the statistics literature) seem to mix poorly over cluster labels. Among others issues, this can have the adverse effect that labels for the same cluster in coupled mixture models are mixed up. We introduce additional moves in these samplers to improve mixing over cluster labels and to bring clusters into correspondence. An application to modeling of storm trajectories is used to illustrate these ideas.


Matrix Tile Analysis

arXiv.org Machine Learning

Many tasks require finding groups of elements in a matrix of numbers, symbols or class likelihoods. One approach is to use efficient bi- or tri-linear factorization techniques including PCA, ICA, sparse matrix factorization and plaid analysis. These techniques are not appropriate when addition and multiplication of matrix elements are not sensibly defined. More directly, methods like bi-clustering can be used to classify matrix elements, but these methods make the overly-restrictive assumption that the class of each element is a function of a row class and a column class. We introduce a general computational problem, `matrix tile analysis' (MTA), which consists of decomposing a matrix into a set of non-overlapping tiles, each of which is defined by a subset of usually nonadjacent rows and columns. MTA does not require an algebra for combining tiles, but must search over discrete combinations of tile assignments. Exact MTA is a computationally intractable integer programming problem, but we describe an approximate iterative technique and a computationally efficient sum-product relaxation of the integer program. We compare the effectiveness of these methods to PCA and plaid on hundreds of randomly generated tasks. Using double-gene-knockout data, we show that MTA finds groups of interacting yeast genes that have biologically-related functions.