Goto

Collaborating Authors

 Uncertainty


Mixture Approximations to Bayesian Networks

arXiv.org Artificial Intelligence

Structure and parameters in a Bayesian network uniquely specify the probability distribution of the modeled domain. The locality of both structure and probabilistic information are the great benefits of Bayesian networks and require the modeler to only specify local information. On the other hand this locality of information might prevent the modeler - and even more any other person - from obtaining a general overview of the important relationships within the domain. The goal of the work presented in this paper is to provide an "alternative" view on the knowledge encoded in a Bayesian network which might sometimes be very helpful for providing insights into the underlying domain. The basic idea is to calculate a mixture approximation to the probability distribution represented by the Bayesian network. The mixture component densities can be thought of as representing typical scenarios implied by the Bayesian model, providing intuition about the basic relationships. As an additional benefit, performing inference in the approximate model is very simple and intuitive and can provide additional insights. The computational complexity for the calculation of the mixture approximations criticaly depends on the measure which defines the distance between the probability distribution represented by the Bayesian network and the approximate distribution. Both the KL-divergence and the backward KL-divergence lead to inefficient algorithms. Incidentally, the latter is used in recent work on mixtures of mean field solutions to which the work presented here is closely related. We show, however, that using a mean squared error cost function leads to update equations which can be solved using the junction tree algorithm. We conclude that the mean squared error cost function can be used for Bayesian networks in which inference based on the junction tree is tractable. For large networks, however, one may have to rely on mean field approximations.


Relative Loss Bounds for On-line Density Estimation with the Exponential Family of Distributions

arXiv.org Machine Learning

We consider on-line density estimation with a parameterized density from the exponential family. The on-line algorithm receives one example at a time and maintains a parameter that is essentially an average of the past examples. After receiving an example the algorithm incurs a loss which is the negative log-likelihood of the example w.r.t. the past parameter of the algorithm. An off-line algorithm can choose the best parameter based on all the examples. We prove bounds on the additional total loss of the on-line algorithm over the total loss of the off-line algorithm. These relative loss bounds hold for an arbitrary sequence of examples. The goal is to design algorithms with the best possible relative loss bounds. We use a certain divergence to derive and analyze the algorithms. This divergence is a relative entropy between two exponential distributions.


On Supervised Selection of Bayesian Networks

arXiv.org Machine Learning

Given a set of possible models (e.g., Bayesian network structures) and a data sample, in the unsupervised model selection problem the task is to choose the most accurate model with respect to the domain joint probability distribution. In contrast to this, in supervised model selection it is a priori known that the chosen model will be used in the future for prediction tasks involving more "focused" predictive distributions. Although focused predictive distributions can be produced from the joint probability distribution by marginalization, in practice the best model in the unsupervised sense does not necessarily perform well in supervised domains. In particular, the standard marginal likelihood score is a criterion for the unsupervised task, and, although frequently used for supervised model selection also, does not perform well in such tasks. In this paper we study the performance of the marginal likelihood score empirically in supervised Bayesian network selection tasks by using a large number of publicly available classification data sets, and compare the results to those obtained by alternative model selection criteria, including empirical crossvalidation methods, an approximation of a supervised marginal likelihood measure, and a supervised version of Dawid's prequential (predictive sequential) principle. The results demonstrate that the marginal likelihood score does not perform well for supervised model selection, while the best results are obtained by using Dawid's prequential approach.


Approximate Learning in Complex Dynamic Bayesian Networks

arXiv.org Machine Learning

In this paper we extend the work of Smith and Papamichail (1999) and present fast approximate Bayesian algorithms for learning in complex scenarios where at any time frame, the relationships between explanatory state space variables can be described by a Bayesian network that evolve dynamically over time and the observations taken are not necessarily Gaussian. It uses recent developments in approximate Bayesian forecasting methods in combination with more familiar Gaussian propagation algorithms on junction trees. The procedure for learning state parameters from data is given explicitly for common sampling distributions and the methodology is illustrated through a real application. The efficiency of the dynamic approximation is explored by using the Hellinger divergence measure and theoretical bounds for the efficacy of such a procedure are discussed.


Accelerating EM: An Empirical Study

arXiv.org Machine Learning

Many applications require that we learn the parameters of a model from data. EM is a method used to learn the parameters of probabilistic models for which the data for some of the variables in the models is either missing or hidden. There are instances in which this method is slow to converge. Therefore, several accelerations have been proposed to improve the method. None of the proposed acceleration methods are theoretically dominant and experimental comparisons are lacking. In this paper, we present the different proposed accelerations and try to compare them experimentally. From the results of the experiments, we argue that some acceleration of EM is always possible, but that which acceleration is superior depends on properties of the problem.


Probabilistic Latent Semantic Analysis

arXiv.org Machine Learning

Probabilistic Latent Semantic Analysis is a novel statistical technique for the analysis of two-mode and co-occurrence data, which has applications in information retrieval and filtering, natural language processing, machine learning from text, and in related areas. Compared to standard Latent Semantic Analysis which stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class model. This results. in a more principled approach which has a solid foundation in statistics. In order to avoid overfitting, we propose a widely applicable generalization of maximum likelihood model fitting by tempered EM.


Inferring Parameters and Structure of Latent Variable Models by Variational Bayes

arXiv.org Machine Learning

Current methods for learning graphical models with latent variables and a fixed structure estimate optimal values for the model parameters. Whereas this approach usually produces overfitting and suboptimal generalization performance, carrying out the Bayesian program of computing the full posterior distributions over the parameters remains a difficult problem. Moreover, learning the structure of models with latent variables, for which the Bayesian approach is crucial, is yet a harder problem. In this paper I present the Variational Bayes framework, which provides a solution to these problems. This approach approximates full posterior distributions over model parameters and structures, as well as latent variables, in an analytical manner without resorting to sampling methods. Unlike in the Laplace approximation, these posteriors are generally non-Gaussian and no Hessian needs to be computed. The resulting algorithm generalizes the standard Expectation Maximization algorithm, and its convergence is guaranteed. I demonstrate that this algorithm can be applied to a large class of models in several domains, including unsupervised clustering and blind source separation.


Multi-class Generalized Binary Search for Active Inverse Reinforcement Learning

arXiv.org Artificial Intelligence

This paper addresses the problem of learning a task from demonstration. We adopt the framework of inverse reinforcement learning, where tasks are represented in the form of a reward function. Our contribution is a novel active learning algorithm that enables the learning agent to query the expert for more informative demonstrations, thus leading to more sample-efficient learning. For this novel algorithm (Generalized Binary Search for Inverse Reinforcement Learning, or GBS-IRL), we provide a theoretical bound on sample complexity and illustrate its applicability on several different tasks. To our knowledge, GBS-IRL is the first active IRL algorithm with provable sample complexity bounds. We also discuss our method in light of other existing methods in the literature and its general applicability in multi-class classification problems. Finally, motivated by recent work on learning from demonstration in robots, we also discuss how different forms of human feedback can be integrated in a transparent manner in our learning framework.


Contextual Weak Independence in Bayesian Networks

arXiv.org Artificial Intelligence

It is well-known that the notion of (strong) conditional independence (CI) is too restrictive to capture independencies that only hold in certain contexts. This kind of contextual independency, called context-strong independence (CSI), can be used to facilitate the acquisition, representation, and inference of probabilistic knowledge. In this paper, we suggest the use of contextual weak independence (CWI) in Bayesian networks. It should be emphasized that the notion of CWI is a more general form of contextual independence than CSI. Furthermore, if the contextual strong independence holds for all contexts, then the notion of CSI becomes strong CI. On the other hand, if the weak contextual independence holds for all contexts, then the notion of CWI becomes weak independence (WI) nwhich is a more general noncontextual independency than strong CI. More importantly, complete axiomatizations are studied for both the class of WI and the class of CI and WI together. Finally, the interesting property of WI being a necessary and sufficient condition for ensuring consistency in granular probabilistic networks is shown.


How to Elicit Many Probabilities

arXiv.org Artificial Intelligence

In building Bayesian belief networks, the elicitation of all probabilities required can be a major obstacle. We learned the extent of this often-cited observation in the construction of the probabilistic part of a complex influence diagram in the field of cancer treatment. Based upon our negative experiences with existing methods, we designed a new method for probability elicitation from domain experts. The method combines various ideas, among which are the ideas of transcribing probabilities and of using a scale with both numerical and verbal anchors for marking assessments. In the construction of the probabilistic part of our influence diagram, the method proved to allow for the elicitation of many probabilities in little time.