Learning Graphical Models
Improved learning of Bayesian networks
The search space of Bayesian Network structures is usually defined as Acyclic Directed Graphs (DAGs) and the search is done by local transformations of DAGs. But the space of Bayesian Networks is ordered by DAG Markov model inclusion and it is natural to consider that a good search policy should take this into account. First attempt to do this (Chickering 1996) was using equivalence classes of DAGs instead of DAGs itself. This approach produces better results but it is significantly slower. We present a compromise between these two approaches. It uses DAGs to search the space in such a way that the ordering by inclusion is taken into account. This is achieved by repetitive usage of local moves within the equivalence class of DAGs. We show that this new approach produces better results than the original DAGs approach without substantial change in time complexity. We present empirical results, within the framework of heuristic search and Markov Chain Monte Carlo, provided through the Alarm dataset.
On characterizing Inclusion of Bayesian Networks
Kocka, Tomas, Bouckaert, Remco R., Studeny, Milan
Every directed acyclic graph (DAG) over a finite non-empty set of variables (= nodes) N induces an independence model over N, which is a list of conditional independence statements over N.The inclusion problem is how to characterize (in graphical terms) whether all independence statements in the model induced by a DAG K are in the model induced by a second DAG L. Meek (1997) conjectured that this inclusion holds iff there exists a sequence of DAGs from L to K such that only certain 'legal' arrow reversal and 'legal' arrow adding operations are performed to get the next DAG in the sequence.In this paper we give several characterizations of inclusion of DAG models and verify Meek's conjecture in the case that the DAGs K and L differ in at most one adjacency. As a warming up a rigorous proof of well-known graphical characterizations of equivalence of DAGs, which is a highly related problem, is given.
Estimating Well-Performing Bayesian Networks using Bernoulli Mixtures
A novel method for estimating Bayesian network (BN) parameters from data is presented which provides improved performance on test data. Previous research has shown the value of representing conditional probability distributions (CPDs) via neural networks(Neal 1992), noisy-OR gates (Neal 1992, Diez 1993)and decision trees (Friedman and Goldszmidt 1996).The Bernoulli mixture network (BMN) explicitly represents the CPDs of discrete BN nodes as mixtures of local distributions,each having a different set of parents.This increases the space of possible structures which can be considered,enabling the CPDs to have finer-grained dependencies.The resulting estimation procedure induces a modelthat is better able to emulate the underlying interactions occurring in the data than conventional conditional Bernoulli network models.The results for artificially generated data indicate that overfitting is best reduced by restricting the complexity of candidate mixture substructures local to each node. Furthermore, mixtures of very simple substructures can perform almost as well as more complex ones.The BMN is also applied to data collected from an online adventure game with an application to keyhole plan recognition. The results show that the BMN-based model brings a dramatic improvement in performance over a conventional BN model.
A Bayesian Approach to Tackling Hard Computational Problems
Horvitz, Eric J., Ruan, Yongshao, Gomes, Carla P., Kautz, Henry, Selman, Bart, Chickering, David Maxwell
We are developing a general framework for using learned Bayesian models for decision-theoretic control of search and reasoningalgorithms. We illustrate the approach on the specific task of controlling both general and domain-specific solvers on a hard class of structured constraint satisfaction problems. A successful strategyfor reducing the high (and even infinite) variance in running time typically exhibited by backtracking search algorithms is to cut off and restart the search if a solution is not found within a certainamount of time. Previous work on restart strategies have employed fixed cut off values. We show how to create a dynamic cut off strategy by learning a Bayesian model that predicts the ultimate length of a trial based on observing the early behavior of the search algorithm. Furthermore, we describe the general conditions under which a dynamic restart strategy can outperform the theoretically optimal fixed strategy.
Enumerating Markov Equivalence Classes of Acyclic Digraph Models
Gillispie, Steven B., Perlman, Michael D.
Graphical Markov models determined by acyclic digraphs (ADGs), also called directed acyclic graphs (DAGs), are widely studied in statistics, computer science (as Bayesian networks), operations research (as influence diagrams), and many related fields. Because different ADGs may determine the same Markov equivalence class, it long has been of interest to determine the efficiency gained in model specification and search by working directly with Markov equivalence classes of ADGs rather than with ADGs themselves. A computer program was written to enumerate the equivalence classes of ADG models as specified by Pearl & Verma's equivalence criterion. The program counted equivalence classes for models up to and including 10 vertices. The ratio of number of classes to ADGs appears to approach an asymptote of about 0.267. Classes were analyzed according to number of edges and class size. By edges, the distribution of number of classes approaches a Gaussian shape. By class size, classes of size 1 are most common, with the proportions for larger sizes initially decreasing but then following a more irregular pattern. The maximum number of classes generated by any undirected graph was found to increase approximately factorially. The program also includes a new variation of orderly algorithm for generating undirected graphs.
Multivariate Information Bottleneck
Friedman, Nir, Mosenzon, Ori, Slonim, Noam, Tishby, Naftali
The Information bottleneck method is an unsupervised non-parametric data organization technique. Given a joint distribution P(A,B), this method constructs a new variable T that extracts partitions, or clusters, over the values of A that are informative about B. The information bottleneck has already been applied to document classification, gene expression, neural code, and spectral analysis. In this paper, we introduce a general principled framework for multivariate extensions of the information bottleneck method. This allows us to consider multiple systems of data partitions that are inter-related. Our approach utilizes Bayesian networks for specifying the systems of clusters and what information each captures. We show that this construction provides insight about bottleneck variations and enables us to characterize solutions of these variations. We also present a general framework for iterative algorithms for constructing solutions, and apply it to several examples.
Learning the Dimensionality of Hidden Variables
A serious problem in learning probabilistic models is the presence of hidden variables. These variables are not observed, yet interact with several of the observed variables. Detecting hidden variables poses two problems: determining the relations to other variables in the model and determining the number of states of the hidden variable. In this paper, we address the latter problem in the context of Bayesian networks. We describe an approach that utilizes a score-based agglomerative state-clustering. As we show, this approach allows us to efficiently evaluate models with a range of cardinalities for the hidden variable. We show how to extend this procedure to deal with multiple interacting hidden variables. We demonstrate the effectiveness of this approach by evaluating it on synthetic and real-life data. We show that our approach learns models with hidden variables that generalize better and have better structure than previous approaches.
Incorporating Expressive Graphical Models in Variational Approximations: Chain-Graphs and Hidden Variables
Global variational approximation methods in graphical models allow efficient approximate inference of complex posterior distributions by using a simpler model. The choice of the approximating model determines a tradeoff between the complexity of the approximation procedure and the quality of the approximation. In this paper, we consider variational approximations based on two classes of models that are richer than standard Bayesian networks, Markov networks or mixture models. As such, these classes allow to find better tradeoffs in the spectrum of approximations. The first class of models are chain graphs, which capture distributions that are partially directed. The second class of models are directed graphs (Bayesian networks) with additional latent variables. Both classes allow representation of multi-variable dependencies that cannot be easily represented within a Bayesian network.
Hybrid Processing of Beliefs and Constraints
Dechter, Rina, Larkin, David Ephraim
This paper explores algorithms for processing probabilistic and deterministic information when the former is represented as a belief network and the latter as a set of boolean clauses. The motivating tasks are 1. evaluating beliefs networks having a large number of deterministic relationships and2. evaluating probabilities of complex boolean querie over a belief network. We propose a parameterized family of variable elimination algorithms that exploit both types of information, and that allows varying levels of constraint propagation inferences. The complexity of the scheme is controlled by the induced-width of the graph {em augmented} by the dependencies introduced by the boolean constraints. Preliminary empirical evaluation demonstrate the effect of constraint propagation on probabilistic computation.
Using Bayesian Networks to Identify the Causal Effect of Speeding in Individual Vehicle/Pedestrian Collisions
On roads showing significant violations of posted speed limits, one measure of the safety effect of speeding is the difference between the road's actual accident count and the count that would have occurred if the posted speed limit had been strictly obeyed. An estimate of this accident reduction can be had by computing the probability that speeding was a necessary condition for each of set of accidents. This is an instance of assessing individual probabilities of causation, which is generally not possible absent prior knowledge of causal structure. For traffic accidents such prior knowledge is often available and this paper illustrates how, for a commonly occurring class of vehicle/pedestrian accidents, approaches to uncertainty and causal analyses appearing in the accident reconstruction literature can be unified using Bayesian networks. Measured skidmarks, pedestrian throw distances, and pedestrian injury severity are treated as evidence, and using the Gibbs Sampling routine BUGS, the posterior probability distribution over exogenous variables, such as the vehicle's initial speed, location, and driver reaction time, is computed. This posterior distribution is then used to compute the "probability of necessity" for speeding.