Bayesian Inference
Reduction of Maximum Entropy Models to Hidden Markov Models
We show that maximum entropy (maxent) models can be modeled with certain kinds of HMMs, allowing us to construct maxent models with hidden variables, hidden state sequences, or other characteristics. The models can be trained using the forward-backward algorithm. While the results are primarily of theoretical interest, unifying apparently unrelated concepts, we also give experimental results for a maxent model with a hidden variable on a word disambiguation task; the model outperforms standard techniques.
Iterative Join-Graph Propagation
Dechter, Rina, Kask, Kalev, Mateescu, Robert
The paper presents an iterative version of join-tree clustering that applies the message passing of join-tree clustering algorithm to join-graphs rather than to join-trees, iteratively. It is inspired by the success of Pearl's belief propagation algorithm as an iterative approximation scheme on one hand, and by a recently introduced mini-clustering i. success as an anytime approximation method, on the other. The proposed Iterative Join-graph Propagation IJGP belongs to the class of generalized belief propagation methods, recently proposed using analogy with algorithms in statistical physics. Empirical evaluation of this approach on a number of problem classes demonstrates that even the most time-efficient variant is almost always superior to IBP and MC i, and is sometimes more accurate by as much as several orders of magnitude.
Interpolating Conditional Density Trees
Joint distributions over many variables are frequently modeled by decomposing them into products of simpler, lower-dimensional conditional distributions, such as in sparsely connected Bayesian networks. However, automatically learning such models can be very computationally expensive when there are many datapoints and many continuous variables with complex nonlinear relationships, particularly when no good ways of decomposing the joint distribution are known a priori. In such situations, previous research has generally focused on the use of discretization techniques in which each continuous variable has a single discretization that is used throughout the entire network. \ In this paper, we present and compare a wide variety of tree-based algorithms for learning and evaluating conditional density estimates over continuous variables. These trees can be thought of as discretizations that vary according to the particular interactions being modeled; however, the density within a given leaf of the tree need not be assumed constant, and we show that such nonuniform leaf densities lead to more accurate density estimation. We have developed Bayesian network structure-learning algorithms that employ these tree-based conditional density representations, and we show that they can be used to practically learn complex joint probability models over dozens of continuous variables from thousands of datapoints. We focus on finding models that are simultaneously accurate, fast to learn, and fast to evaluate once they are learned.
Convex Relaxations for Learning Bounded Treewidth Decomposable Graphs
Kumar, K. S. Sesh, Bach, Francis
We consider the problem of learning the structure of undirected graphical models with bounded treewidth, within the maximum likelihood framework. This is an NP-hard problem and most approaches consider local search techniques. In this paper, we pose it as a combinatorial optimization problem, which is then relaxed to a convex optimization problem that involves searching over the forest and hyperforest polytopes with special structures, independently. A supergradient method is used to solve the dual problem, with a run-time complexity of $O(k^3 n^{k+2} \log n)$ for each iteration, where $n$ is the number of variables and $k$ is a bound on the treewidth. We compare our approach to state-of-the-art methods on synthetic datasets and classical benchmarks, showing the gains of the novel convex approach.
PAC-Bayesian Learning and Domain Adaptation
Germain, Pascal, Habrard, Amaury, Laviolette, Franรงois, Morvant, Emilie
In machine learning, Domain Adaptation (DA) arises when the distribution gen- erating the test (target) data differs from the one generating the learning (source) data. It is well known that DA is an hard task even under strong assumptions, among which the covariate-shift where the source and target distributions diverge only in their marginals, i.e. they have the same labeling function. Another popular approach is to consider an hypothesis class that moves closer the two distributions while implying a low-error for both tasks. This is a VC-dim approach that restricts the complexity of an hypothesis class in order to get good generalization. Instead, we propose a PAC-Bayesian approach that seeks for suitable weights to be given to each hypothesis in order to build a majority vote. We prove a new DA bound in the PAC-Bayesian context. This leads us to design the first DA-PAC-Bayesian algorithm based on the minimization of the proposed bound. Doing so, we seek for a \rho-weighted majority vote that takes into account a trade-off between three quantities. The first two quantities being, as usual in the PAC-Bayesian approach, (a) the complexity of the majority vote (measured by a Kullback-Leibler divergence) and (b) its empirical risk (measured by the \rho-average errors on the source sample). The third quantity is (c) the capacity of the majority vote to distinguish some structural difference between the source and target samples.
Simulation-based optimal Bayesian experimental design for nonlinear systems
Huan, Xun, Marzouk, Youssef M.
The optimal selection of experimental conditions is essential to maximizing the value of data for inference and prediction, particularly in situations where experiments are time-consuming and expensive to conduct. We propose a general mathematical framework and an algorithmic approach for optimal experimental design with nonlinear simulation-based models; in particular, we focus on finding sets of experiments that provide the most information about targeted sets of parameters. Our framework employs a Bayesian statistical setting, which provides a foundation for inference from noisy, indirect, and incomplete data, and a natural mechanism for incorporating heterogeneous sources of information. An objective function is constructed from information theoretic measures, reflecting expected information gain from proposed combinations of experiments. Polynomial chaos approximations and a two-stage Monte Carlo sampling method are used to evaluate the expected information gain. Stochastic approximation algorithms are then used to make optimization feasible in computationally intensive and high-dimensional settings. These algorithms are demonstrated on model problems and on nonlinear parameter estimation problems arising in detailed combustion kinetics.
Bayesian learning of noisy Markov decision processes
Singh, Sumeetpal S., Chopin, Nicolas, Whiteley, Nick
We consider the inverse reinforcement learning problem, that is, the problem of learning from, and then predicting or mimicking a controller based on state/action data. We propose a statistical model for such data, derived from the structure of a Markov decision process. Adopting a Bayesian approach to inference, we show how latent variables of the model can be estimated, and how predictions about actions can be made, in a unified framework. A new Markov chain Monte Carlo (MCMC) sampler is devised for simulation from the posterior distribution. This step includes a parameter expansion step, which is shown to be essential for good convergence properties of the MCMC sampler. As an illustration, the method is applied to learning a human controller.
A Traveling Salesman Learns Bayesian Networks
Sahai, Tuhin, Klus, Stefan, Dellnitz, Michael
Structure learning of Bayesian networks is an important problem that arises in numerous machine learning applications. In this work, we present a novel approach for learning the structure of Bayesian networks using the solution of an appropriately constructed traveling salesman problem. In our approach, one computes an optimal ordering (partially ordered set) of random variables using methods for the traveling salesman problem. This ordering significantly reduces the search space for the subsequent greedy optimization that computes the final structure of the Bayesian network. We demonstrate our approach of learning Bayesian networks on real world census and weather datasets. In both cases, we demonstrate that the approach very accurately captures dependencies between random variables. We check the accuracy of the predictions based on independent studies in both application domains.
A survey of non-exchangeable priors for Bayesian nonparametric models
Foti, Nicholas J., Williamson, Sinead
There has recently been a spate of papers in the statistics and machine learning literature developing dependent stochastic processes and using them as priors in Bayesian nonparametric models. In this paper, we aim to provide a representative snapshot of the currently available models, to elucidate links between these models, and to provide an orienting view of the modern constructions of these processes. Traditional nonparametric priors such as the Dirichlet process [DP, 2], Chinese restaurant process [CRP, 3], Pitman-Yor process [4] and the Indian buffet process [IBP, 5] assume that our observations are exchangeable. Under the assumption of exchangeability the order of the data points does not change the probability distribution. Exchangeability is not a valid assumption for all data.
A unifying representation for a class of dependent random measures
Foti, Nicholas J., Futoma, Joseph D., Rockmore, Daniel N., Williamson, Sinead
We present a general construction for dependent random measures based on thinning Poisson processes on an augmented space. The framework is not restricted to dependent versions of a specific nonparametric model, but can be applied to all models that can be represented using completely random measures. Several existing dependent random measures can be seen as specific cases of this framework. Interesting properties of the resulting measures are derived and the efficacy of the framework is demonstrated by constructing a covariate-dependent latent feature model and topic model that obtain superior predictive performance.