Goto

Collaborating Authors

 Bayesian Learning


Entropy-based Pruning for Learning Bayesian Networks using BIC

arXiv.org Machine Learning

For decomposable score-based structure learning of Bayesian networks, existing approaches first compute a collection of candidate parent sets for each variable and then optimize over this collection by choosing one parent set for each variable without creating directed cycles while maximizing the total score. We target the task of constructing the collection of candidate parent sets when the score of choice is the Bayesian Information Criterion (BIC). We provide new non-trivial results that can be used to prune the search space of candidate parent sets of each node. We analyze how these new results relate to previous ideas in the literature both theoretically and empirically. We show in experiments with UCI data sets that gains can be significant. Since the new pruning rules are easy to implement and have low computational costs, they can be promptly integrated into all state-of-the-art methods for structure learning of Bayesian networks.


MML is not consistent for Neyman-Scott

arXiv.org Machine Learning

Strict Minimum Message Length (SMML) is a statistical inference method widely cited (but only with informal arguments) as providing estimations that are consistent for general estimation problems. It is, however, almost invariably intractable to compute, for which reason only approximations of it (known as MML algorithms) are ever used in practice. We investigate the Neyman-Scott estimation problem, an oft-cited showcase for the consistency of MML, and show that even with a natural choice of prior, neither SMML nor its popular approximations are consistent for it, thereby providing a counterexample to the general claim. This is the first known explicit construction of an SMML solution for a natural, high-dimensional problem. We use the same novel construction methods to refute other claims regarding MML also appearing in the literature.


Improving Output Uncertainty Estimation and Generalization in Deep Learning via Neural Network Gaussian Processes

arXiv.org Machine Learning

We propose a simple method that combines neural networks and Gaussian processes. The proposed method can estimate the uncertainty of outputs and flexibly adjust target functions where training data exist, which are advantages of Gaussian processes. The proposed method can also achieve high generalization performance for unseen input configurations, which is an advantage of neural networks. With the proposed method, neural networks are used for the mean functions of Gaussian processes. We present a scalable stochastic inference procedure, where sparse Gaussian processes are inferred by stochastic variational inference, and the parameters of neural networks and kernels are estimated by stochastic gradient descent methods, simultaneously. We use two real-world spatio-temporal data sets to demonstrate experimentally that the proposed method achieves better uncertainty estimation and generalization performance than neural networks and Gaussian processes.


On Adaptive Propensity Score Truncation in Causal Inference

arXiv.org Machine Learning

The positivity assumption, or the experimental treatment assignment (ETA) assumption, is important for identifiability in causal inference. Even if the positivity assumption holds, practical violations of this assumption may jeopardize the finite sample performance of the causal estimator. One of the consequences of practical violations of the positivity assumption is extreme values in the estimated propensity score (PS). A common practice to address this issue is truncating the PS estimate when constructing PS-based estimators. In this study, we propose a novel adaptive truncation method, Positivity-C-TMLE, based on the collaborative targeted maximum likelihood estimation (C-TMLE) methodology. We demonstrate the outstanding performance of our novel approach in a variety of simulations by comparing it with other commonly studied estimators. Results show that by adaptively truncating the estimated PS with a more targeted objective function, the Positivity-C-TMLE estimator achieves the best performance for both point estimation and confidence interval coverage among all estimators considered.


Improving Gibbs Sampler Scan Quality with DoGS

arXiv.org Machine Learning

The pairwise influence matrix of Dobrushin has long been used as an analytical tool to bound the rate of convergence of Gibbs sampling. In this work, we use Dobrushin influence as the basis of a practical tool to certify and efficiently improve the quality of a discrete Gibbs sampler. Our Dobrushin-optimized Gibbs samplers (DoGS) offer customized variable selection orders for a given sampling budget and variable subset of interest, explicit bounds on total variation distance to stationarity, and certifiable improvements over the standard systematic and uniform random scan Gibbs samplers. In our experiments with joint image segmentation and object recognition, Markov chain Monte Carlo maximum likelihood estimation, and Ising model inference, DoGS consistently deliver higher-quality inferences with significantly smaller sampling budgets than standard Gibbs samplers.


One-Shot Learning in Discriminative Neural Networks

arXiv.org Machine Learning

We consider the task of one-shot learning of visual categories, or more generally, learning to classify images with few examples of particular classes. The currently dominant image classification paradigm of supervised deep learning performs well only when data is abundant. In this paper we explore a Bayesian procedure for updating a pretrained convnet to classify a novel image category for which data is limited. We demonstrate that the approach is competitive with state-of-the-art methods whilst also being consistent with'normal' methods for training deep networks on large data. Several approaches to one-shot learning have been noted as failing to beat a simple nearest-neighbour classifier [8]. Recent approaches of the problem have used relatively complicated architectures such as memory augmented neural networks [9, 10] or siamese networks [5]; or have been specialised for the task of one-shot learning [10]. Fei-Fei et al. [2] demonstrated one-shot learning as a Bayesian update to an image classification model with a prior based on categories learned with lots of data. Our work is an modern update of this work, applying this technique to deep convolutional networks.


Bayesian Nonlinear Support Vector Machines for Big Data

arXiv.org Machine Learning

We propose a fast inference method for Bayesian nonlinear support vector machines that leverages stochastic variational inference and inducing points. Our experiments show that the proposed method is faster than competing Bayesian approaches and scales easily to millions of data points. It provides additional features over frequentist competitors such as accurate predictive uncertainty estimates and automatic hyperparameter search.


The Next AI Milestone: Bridging the Semantic Gap – Intuition Machine – Medium

#artificialintelligence

John Launchbury of DARPA has an excellent video that I recommend everyone watch ( viewing just the slides will give one a wrong impression of the content). Statistical Learning -- Where programmers create statistical models for specific problem domains and train them on big data. Contextual Adaptation -- Where systems construct contextual explanatory models for classes of real world phenomena. It's a bit of a simplified presentation because it lumps all of machine learning, Bayesian methods and Deep Learning into a single category. There are many more approaches to AI that don't fit within DARPA's 3 waves.


PAC-Bayes and Domain Adaptation

arXiv.org Machine Learning

We provide two main contributions in PAC-Bayesian theory for domain adaptation where the objective is to learn, from a source distribution, a well-performing majority vote on a different, but related, target distribution. Firstly, we propose an improvement of the previous approach we proposed in Germain et al. (2013), which relies on a novel distribution pseudodistance based on a disagreement averaging, allowing us to derive a new tighter domain adaptation bound for the target risk. While this bound stands in the spirit of common domain adaptation works, we derive a second bound (recently introduced in Germain et al., 2016) that brings a new perspective on domain adaptation by deriving an upper bound on the target risk where the distributions' divergence--expressed as a ratio-- controls the tradeoff between a source error measure and the target voters' disagreement. We discuss and compare both results, from which we obtain PAC-Bayesian generalization bounds. Furthermore, from the PAC-Bayesian specialization to linear classifiers, we infer two learning algorithms, and we evaluate them on real data.


Cooperative Hierarchical Dirichlet Processes: Superposition vs. Maximization

arXiv.org Machine Learning

The cooperative hierarchical structure is a common and significant data structure observed in, or adopted by, many research areas, such as: text mining (author-paper-word) and multi-label classification (label-instance-feature). Renowned Bayesian approaches for cooperative hierarchical structure modeling are mostly based on topic models. However, these approaches suffer from a serious issue in that the number of hidden topics/factors needs to be fixed in advance and an inappropriate number may lead to overfitting or underfitting. One elegant way to resolve this issue is Bayesian nonparametric learning, but existing work in this area still cannot be applied to cooperative hierarchical structure modeling. In this paper, we propose a cooperative hierarchical Dirichlet process (CHDP) to fill this gap. Each node in a cooperative hierarchical structure is assigned a Dirichlet process to model its weights on the infinite hidden factors/topics. Together with measure inheritance from hierarchical Dirichlet process, two kinds of measure cooperation, i.e., superposition and maximization, are defined to capture the many-to-many relationships in the cooperative hierarchical structure. Furthermore, two constructive representations for CHDP, i.e., stick-breaking and international restaurant process, are designed to facilitate the model inference. Experiments on synthetic and real-world data with cooperative hierarchical structures demonstrate the properties and the ability of CHDP for cooperative hierarchical structure modeling and its potential for practical application scenarios.