Bayesian Inference
Context-Specific Independence in Bayesian Networks
Boutilier, Craig, Friedman, Nir, Goldszmidt, Moises, Koller, Daphne
Bayesian networks provide a language for qualitatively representing the conditional independence properties of a distribution. This allows a natural and compact representation of the distribution, eases knowledge acquisition, and supports effective inference algorithms. It is well-known, however, that there are certain independencies that we cannot capture qualitatively within the Bayesian network structure: independencies that hold only in certain contexts, i.e., given a specific assignment of values to certain variables. In this paper, we propose a formal notion of context-specific independence (CSI), based on regularities in the conditional probability tables (CPTs) at a node. We present a technique, analogous to (and based on) d-separation, for determining when such independence holds in a given network. We then focus on a particular qualitative representation scheme - tree-structured CPTs - for capturing CSI. We suggest ways in which this representation can be used to support effective inference algorithms. In particular, we present a structural decomposition of the resulting network which can improve the performance of clustering algorithms, and an alternative algorithm based on cutset conditioning.
An Alternative Markov Property for Chain Graphs
Andersson, Steen A., Madigan, David, Perlman, Michael D.
Graphical Markov models use graphs, either undirected, directed, or mixed, to represent possible dependences among statistical variables. Applications of undirected graphs (UDGs) include models for spatial dependence and image analysis, while acyclic directed graphs (ADGs), which are especially convenient for statistical analysis, arise in such fields as genetics and psychometrics and as models for expert systems and Bayesian belief networks. Lauritzen, Wermuth and Frydenberg (LWF) introduced a Markov property for chain graphs, which are mixed graphs that can be used to represent simultaneously both causal and associative dependencies and which include both UDGs and ADGs as special cases. In this paper an alternative Markov property (AMP) for chain graphs is introduced, which in some ways is a more direct extension of the ADG Markov property than is the LWF property for chain graph.
A Structurally and Temporally Extended Bayesian Belief Network Model: Definitions, Properties, and Modeling Techniques
Aliferis, Constantin F., Cooper, Gregory F.
We developed the language of Modifiable Temporal Belief Networks (MTBNs) as a structural and temporal extension of Bayesian Belief Networks (BNs) to facilitate normative temporal and causal modeling under uncertainty. In this paper we present definitions of the model, its components, and its fundamental properties. We also discuss how to represent various types of temporal knowledge, with an emphasis on hybrid temporal-explicit time modeling, dynamic structures, avoiding causal temporal inconsistencies, and dealing with models that involve simultaneously actions (decisions) and causal and non-causal associations. We examine the relationships among BNs, Modifiable Belief Networks, and MTBNs with a single temporal granularity, and suggest areas of application suitable to each one of them.
Inference Using Message Propagation and Topology Transformation in Vector Gaussian Continuous Networks
Alag, Satnam, Agogino, Alice M.
We extend Gaussian networks - directed acyclic graphs that encode probabilistic relationships between variables - to its vector form. Vector Gaussian continuous networks consist of composite nodes representing multivariates, that take continuous values. These vector or composite nodes can represent correlations between parents, as opposed to conventional univariate nodes. We derive rules for inference in these networks based on two methods: message propagation and topology transformation. These two approaches lead to the development of algorithms, that can be implemented in either a centralized or a decentralized manner. The domain of application of these networks are monitoring and estimation problems. This new representation along with the rules for inference developed here can be used to derive current Bayesian algorithms such as the Kalman filter, and provide a rich foundation to develop new algorithms. We illustrate this process by deriving the decentralized form of the Kalman filter. This work unifies concepts from artificial intelligence and modern control theory.
Computational Complexity Reduction for BN2O Networks Using Similarity of States
Kozlov, Alexander V., Singh, Jaswinder Pal
Although probabilistic inference in a general Bayesian belief network is an NP-hard problem, computation time for inference can be reduced in most practical cases by exploiting domain knowledge and by making approximations in the knowledge representation. In this paper we introduce the property of similarity of states and a new method for approximate knowledge representation and inference which is based on this property. We define two or more states of a node to be similar when the ratio of their probabilities, the likelihood ratio, does not depend on the instantiations of the other nodes in the network. We show that the similarity of states exposes redundancies in the joint probability distribution which can be exploited to reduce the computation time of probabilistic inference in networks with multiple similar states, and that the computational complexity in the networks with exponentially many similar states might be polynomial. We demonstrate our ideas on the example of a BN2O network -- a two layer network often used in diagnostic problems -- by reducing it to a very close network with multiple similar states. We show that the answers to practical queries converge very fast to the answers obtained with the original network. The maximum error is as low as 5% for models that require only 10% of the computation time needed by the original BN2O model.
The trace norm constrained matrix-variate Gaussian process for multitask bipartite ranking
Koyejo, Oluwasanmi, Lee, Cheng, Ghosh, Joydeep
We propose a novel hierarchical model for multitask bipartite ranking. The proposed approach combines a matrix-variate Gaussian process with a generative model for task-wise bipartite ranking. In addition, we employ a novel trace constrained variational inference approach to impose low rank structure on the posterior matrix-variate Gaussian process. The resulting posterior covariance function is derived in closed form, and the posterior mean function is the solution to a matrix-variate regression with a novel spectral elastic net regularizer. Further, we show that variational inference for the trace constrained matrix-variate Gaussian process combined with maximum likelihood parameter estimation for the bipartite ranking model is jointly convex. Our motivating application is the prioritization of candidate disease genes. The goal of this task is to aid the identification of unobserved associations between human genes and diseases using a small set of observed associations as well as kernels induced by gene-gene interaction networks and disease ontologies. Our experimental results illustrate the performance of the proposed model on real world datasets. Moreover, we find that the resulting low rank solution improves the computational scalability of training and testing as compared to baseline models.
Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models
Flynn, Cheryl J., Hurvich, Clifford M., Simonoff, Jeffrey S.
It has been shown that AIC-type criteria are asymptotically efficient selectors of the tuning parameter in non-concave penalized regression methods under the assumption that the population variance is known or that a consistent estimator is available. We relax this assumption to prove that AIC itself is asymptotically efficient and we study its performance in finite samples. In classical regression, it is known that AIC tends to select overly complex models when the dimension of the maximum candidate model is large relative to the sample size. Simulation studies suggest that AIC suffers from the same shortcomings when used in penalized regression. We therefore propose the use of the classical corrected AIC (AICc) as an alternative and prove that it maintains the desired asymptotic properties. To broaden our results, we further prove the efficiency of AIC for penalized likelihood methods in the context of generalized linear models with no dispersion parameter. Similar results exist in the literature but only for a restricted set of candidate models. By employing results from the classical literature on maximum-likelihood estimation in misspecified models, we are able to establish this result for a general set of candidate models. We use simulations to assess the performance of AIC and AICc, as well as that of other selectors, in finite samples for both SCAD-penalized and Lasso regressions and a real data example is considered.
Support and Plausibility Degrees in Generalized Functional Models
By discussing several examples, the theory of generalized functional models is shown to be very natural for modeling some situations of reasoning under uncertainty. A generalized functional model is a pair (f, P) where f is a function describing the interactions between a parameter variable, an observation variable and a random source, and P is a probability distribution for the random source. Unlike traditional functional models, generalized functional models do not require that there is only one value of the parameter variable that is compatible with an observation and a realization of the random source. As a consequence, the results of the analysis of a generalized functional model are not expressed in terms of probability distributions but rather by support and plausibility functions. The analysis of a generalized functional model is very logical and is inspired from ideas already put forward by R.A. Fisher in his theory of fiducial probability.
Models and Selection Criteria for Regression and Classification
Heckerman, David, Meek, Christopher
When performing regression or classification, we are interested in the conditional probability distribution for an outcome or class variable Y given a set of explanatoryor input variables X. We consider Bayesian models for this task. In particular, we examine a special class of models, which we call Bayesian regression/classification (BRC) models, that can be factored into independent conditional (y|x) and input (x) models. These models are convenient, because the conditional model (the portion of the full model that we care about) can be analyzed by itself. We examine the practice of transforming arbitrary Bayesian models to BRC models, and argue that this practice is often inappropriate because it ignores prior knowledge that may be important for learning. In addition, we examine Bayesian methods for learning models from data. We discuss two criteria for Bayesian model selection that are appropriate for repression/classification: one described by Spiegelhalter et al. (1993), and another by Buntine (1993). We contrast these two criteria using the prequential framework of Dawid (1984), and give sufficient conditions under which the criteria agree.
Update Rules for Parameter Estimation in Bayesian Networks
Bauer, Eric, Koller, Daphne, Singer, Yoram
This paper re-examines the problem of parameter estimation in Bayesian networks with missing values and hidden variables from the perspective of recent work in on-line learning [Kivinen & Warmuth, 1994]. We provide a unified framework for parameter estimation that encompasses both on-line learning, where the model is continuously adapted to new data cases as they arrive, and the more traditional batch learning, where a pre-accumulated set of samples is used in a one-time model selection process. In the batch case, our framework encompasses both the gradient projection algorithm and the EM algorithm for Bayesian networks. The framework also leads to new on-line and batch parameter update schemes, including a parameterized version of EM. We provide both empirical and theoretical results indicating that parameterized EM allows faster convergence to the maximum likelihood parameters than does standard EM.