Goto

Collaborating Authors

 Bayesian Inference


MML is not consistent for Neyman-Scott

arXiv.org Machine Learning

Strict Minimum Message Length (SMML) is a statistical inference method widely cited (but only with informal arguments) as providing estimations that are consistent for general estimation problems. It is, however, almost invariably intractable to compute, for which reason only approximations of it (known as MML algorithms) are ever used in practice. We investigate the Neyman-Scott estimation problem, an oft-cited showcase for the consistency of MML, and show that even with a natural choice of prior, neither SMML nor its popular approximations are consistent for it, thereby providing a counterexample to the general claim. This is the first known explicit construction of an SMML solution for a natural, high-dimensional problem. We use the same novel construction methods to refute other claims regarding MML also appearing in the literature.


Improving Output Uncertainty Estimation and Generalization in Deep Learning via Neural Network Gaussian Processes

arXiv.org Machine Learning

We propose a simple method that combines neural networks and Gaussian processes. The proposed method can estimate the uncertainty of outputs and flexibly adjust target functions where training data exist, which are advantages of Gaussian processes. The proposed method can also achieve high generalization performance for unseen input configurations, which is an advantage of neural networks. With the proposed method, neural networks are used for the mean functions of Gaussian processes. We present a scalable stochastic inference procedure, where sparse Gaussian processes are inferred by stochastic variational inference, and the parameters of neural networks and kernels are estimated by stochastic gradient descent methods, simultaneously. We use two real-world spatio-temporal data sets to demonstrate experimentally that the proposed method achieves better uncertainty estimation and generalization performance than neural networks and Gaussian processes.


On Adaptive Propensity Score Truncation in Causal Inference

arXiv.org Machine Learning

The positivity assumption, or the experimental treatment assignment (ETA) assumption, is important for identifiability in causal inference. Even if the positivity assumption holds, practical violations of this assumption may jeopardize the finite sample performance of the causal estimator. One of the consequences of practical violations of the positivity assumption is extreme values in the estimated propensity score (PS). A common practice to address this issue is truncating the PS estimate when constructing PS-based estimators. In this study, we propose a novel adaptive truncation method, Positivity-C-TMLE, based on the collaborative targeted maximum likelihood estimation (C-TMLE) methodology. We demonstrate the outstanding performance of our novel approach in a variety of simulations by comparing it with other commonly studied estimators. Results show that by adaptively truncating the estimated PS with a more targeted objective function, the Positivity-C-TMLE estimator achieves the best performance for both point estimation and confidence interval coverage among all estimators considered.


Improving Gibbs Sampler Scan Quality with DoGS

arXiv.org Machine Learning

The pairwise influence matrix of Dobrushin has long been used as an analytical tool to bound the rate of convergence of Gibbs sampling. In this work, we use Dobrushin influence as the basis of a practical tool to certify and efficiently improve the quality of a discrete Gibbs sampler. Our Dobrushin-optimized Gibbs samplers (DoGS) offer customized variable selection orders for a given sampling budget and variable subset of interest, explicit bounds on total variation distance to stationarity, and certifiable improvements over the standard systematic and uniform random scan Gibbs samplers. In our experiments with joint image segmentation and object recognition, Markov chain Monte Carlo maximum likelihood estimation, and Ising model inference, DoGS consistently deliver higher-quality inferences with significantly smaller sampling budgets than standard Gibbs samplers.


One-Shot Learning in Discriminative Neural Networks

arXiv.org Machine Learning

We consider the task of one-shot learning of visual categories, or more generally, learning to classify images with few examples of particular classes. The currently dominant image classification paradigm of supervised deep learning performs well only when data is abundant. In this paper we explore a Bayesian procedure for updating a pretrained convnet to classify a novel image category for which data is limited. We demonstrate that the approach is competitive with state-of-the-art methods whilst also being consistent with'normal' methods for training deep networks on large data. Several approaches to one-shot learning have been noted as failing to beat a simple nearest-neighbour classifier [8]. Recent approaches of the problem have used relatively complicated architectures such as memory augmented neural networks [9, 10] or siamese networks [5]; or have been specialised for the task of one-shot learning [10]. Fei-Fei et al. [2] demonstrated one-shot learning as a Bayesian update to an image classification model with a prior based on categories learned with lots of data. Our work is an modern update of this work, applying this technique to deep convolutional networks.


Bayesian Nonlinear Support Vector Machines for Big Data

arXiv.org Machine Learning

We propose a fast inference method for Bayesian nonlinear support vector machines that leverages stochastic variational inference and inducing points. Our experiments show that the proposed method is faster than competing Bayesian approaches and scales easily to millions of data points. It provides additional features over frequentist competitors such as accurate predictive uncertainty estimates and automatic hyperparameter search.


Merging MCMC Subposteriors through Gaussian-Process Approximations

arXiv.org Machine Learning

Markov chain Monte Carlo (MCMC) algorithms have become powerful tools for Bayesian inference. However, they do not scale well to large-data problems. Divide-and-conquer strategies, which split the data into batches and, for each batch, run independent MCMC algorithms targeting the corresponding subposterior, can spread the computational burden across a number of separate workers. The challenge with such strategies is in recombining the subposteriors to approximate the full posterior. By creating a Gaussian-process approximation for each log-subposterior density we create a tractable approximation for the full posterior. This approximation is exploited through three methodologies: firstly a Hamiltonian Monte Carlo algorithm targeting the expectation of the posterior density provides a sample from an approximation to the posterior; secondly, evaluating the true posterior at the sampled points leads to an importance sampler that, asymptotically, targets the true posterior expectations; finally, an alternative importance sampler uses the full Gaussian-process distribution of the approximation to the log-posterior density to re-weight any initial sample and provide both an estimate of the posterior expectation and a measure of the uncertainty in it.


PAC-Bayes and Domain Adaptation

arXiv.org Machine Learning

We provide two main contributions in PAC-Bayesian theory for domain adaptation where the objective is to learn, from a source distribution, a well-performing majority vote on a different, but related, target distribution. Firstly, we propose an improvement of the previous approach we proposed in Germain et al. (2013), which relies on a novel distribution pseudodistance based on a disagreement averaging, allowing us to derive a new tighter domain adaptation bound for the target risk. While this bound stands in the spirit of common domain adaptation works, we derive a second bound (recently introduced in Germain et al., 2016) that brings a new perspective on domain adaptation by deriving an upper bound on the target risk where the distributions' divergence--expressed as a ratio-- controls the tradeoff between a source error measure and the target voters' disagreement. We discuss and compare both results, from which we obtain PAC-Bayesian generalization bounds. Furthermore, from the PAC-Bayesian specialization to linear classifiers, we infer two learning algorithms, and we evaluate them on real data.


Cooperative Hierarchical Dirichlet Processes: Superposition vs. Maximization

arXiv.org Machine Learning

The cooperative hierarchical structure is a common and significant data structure observed in, or adopted by, many research areas, such as: text mining (author-paper-word) and multi-label classification (label-instance-feature). Renowned Bayesian approaches for cooperative hierarchical structure modeling are mostly based on topic models. However, these approaches suffer from a serious issue in that the number of hidden topics/factors needs to be fixed in advance and an inappropriate number may lead to overfitting or underfitting. One elegant way to resolve this issue is Bayesian nonparametric learning, but existing work in this area still cannot be applied to cooperative hierarchical structure modeling. In this paper, we propose a cooperative hierarchical Dirichlet process (CHDP) to fill this gap. Each node in a cooperative hierarchical structure is assigned a Dirichlet process to model its weights on the infinite hidden factors/topics. Together with measure inheritance from hierarchical Dirichlet process, two kinds of measure cooperation, i.e., superposition and maximization, are defined to capture the many-to-many relationships in the cooperative hierarchical structure. Furthermore, two constructive representations for CHDP, i.e., stick-breaking and international restaurant process, are designed to facilitate the model inference. Experiments on synthetic and real-world data with cooperative hierarchical structures demonstrate the properties and the ability of CHDP for cooperative hierarchical structure modeling and its potential for practical application scenarios.


Efficient Online Learning for Optimizing Value of Information: Theory and Application to Interactive Troubleshooting

arXiv.org Artificial Intelligence

We consider the optimal value of information (VoI) problem, where the goal is to sequentially select a set of tests with a minimal cost, so that one can efficiently make the best decision based on the observed outcomes. Existing algorithms are either heuristics with no guarantees, or scale poorly (with exponential run time in terms of the number of available tests). Moreover, these methods assume a known distribution over the test outcomes, which is often not the case in practice. We propose an efficient sampling-based online learning framework to address the above issues. First, assuming the distribution over hypotheses is known, we propose a dynamic hypothesis enumeration strategy, which allows efficient information gathering with strong theoretical guarantees. We show that with sufficient amount of samples, one can identify a near-optimal decision with high probability. Second, when the parameters of the hypotheses distribution are unknown, we propose an algorithm which learns the parameters progressively via posterior sampling in an online fashion. We further establish a rigorous bound on the expected regret. We demonstrate the effectiveness of our approach on a real-world interactive troubleshooting application and show that one can efficiently make high-quality decisions with low cost.