Bayesian Learning
Deep Diffeomorphic Normalizing Flows
Salman, Hadi, Yadollahpour, Payman, Fletcher, Tom, Batmanghelich, Kayhan
The Normalizing Flow (NF) models a general probability density by estimating an invertible transformation applied on samples drawn from a known distribution. We introduce a new type of NF, called Deep Diffeomorphic Normalizing Flow (DDNF). A diffeomorphic flow is an invertible function where both the function and its inverse are smooth. We construct the flow using an ordinary differential equation (ODE) governed by a time-varying smooth vector field. We use a neural network to parametrize the smooth vector field and a recursive neural network (RNN) for approximating the solution of the ODE. Each cell in the RNN is a residual network implementing one Euler integration step. The architecture of our flow enables efficient likelihood evaluation, straightforward flow inversion, and results in highly flexible density estimation. An end-to-end trained DDNF achieves competitive results with state-of-the-art methods on a suite of density estimation and variational inference tasks. Finally, our method brings concepts from Riemannian geometry that, we believe, can open a new research direction for neural density estimation.
Deep convolutional Gaussian processes
Blomqvist, Kenneth, Kaski, Samuel, Heinonen, Markus
We propose deep convolutional Gaussian processes, a deep Gaussian process architecture with convolutional structure. The model is a principled Bayesian framework for detecting hierarchical combinations of local features for image classification. We demonstrate greatly improved image classification performance compared to current Gaussian process approaches on the MNIST and CIFAR-10 datasets. In particular, we improve CIFAR-10 accuracy by over 10 percentage points.
Bayes-CPACE: PAC Optimal Exploration in Continuous Space Bayes-Adaptive Markov Decision Processes
Lee, Gilwoo, Choudhury, Sanjiban, Hou, Brian, Srinivasa, Siddhartha S.
We present the first PAC optimal algorithm for Bayes-Adaptive Markov Decision Processes (BAMDPs) in continuous state and action spaces, to the best of our knowledge. The BAMDP framework elegantly addresses model uncertainty by incorporating Bayesian belief updates into long-term expected return. However, computing an exact optimal Bayesian policy is intractable. Our key insight is to compute a near-optimal value function by covering the continuous state-belief-action space with a finite set of representative samples and exploiting the Lipschitz continuity of the value function. We prove the near-optimality of our algorithm and analyze a number of schemes that boost the algorithm's efficiency. Finally, we empirically validate our approach on a number of discrete and continuous BAMDPs and show that the learned policy has consistently competitive performance against baseline approaches.
Text Classification of the Precursory Accelerating Seismicity Corpus: Inference on some Theoretical Trends in Earthquake Predictability Research from 1988 to 2018
Text analytics based on supervised machine learning classifiers has shown great promise in a multitude of domains, but has yet to be applied to Seismology. We test various standard models (Naive Bayes, k-Nearest Neighbors, Support Vector Machines, and Random Forests) on a seismological corpus of 100 articles related to the topic of precursory accelerating seismicity, spanning from 1988 to 2010. This corpus was labelled in Mignan (2011) with the precursor whether explained by critical processes (i.e., cascade triggering) or by other processes (such as signature of main fault loading). We investigate rather the classification process can be automatized to help analyze larger corpora in order to better understand trends in earthquake predictability research. We find that the Naive Bayes model performs best, in agreement with the machine learning literature for the case of small datasets, with cross-validation accuracies of 86% for binary classification. For a refined multiclass classification ('non-critical process' < 'agnostic' < 'critical process assumed' < 'critical process demonstrated'), we obtain up to 78% accuracy. Prediction on a dozen of articles published since 2011 shows however a weak generalization with a F1-score of 60%, only slightly better than a random classifier, which can be explained by a change of authorship and use of different terminologies. Yet, the model shows F1-scores greater than 80% for the two multiclass extremes ('non-critical process' versus 'critical process demonstrated') while it falls to random classifier results (around 25%) for papers labelled 'agnostic' or 'critical process assumed'. Those results are encouraging in view of the small size of the corpus and of the high degree of abstraction of the labelling. Domain knowledge engineering remains essential but can be made transparent by an investigation of Naive Bayes keyword posterior probabilities.
IMMIGRATE: A Margin-based Feature Selection Method with Interaction Terms
Zhao, Ruzhang, Hong, Pengyu, Liu, Jun S.
By balancing margin-quantity maximization and margin-quality maximization, the proposed IMMIGRATE algorithm considers both local and global information when using margin-based frameworks. We here derive a new mathematical interpretation of margin-based cost function by using the quadratic form distance (QFD) and applying both the large-margin and max-min entropy principles. We also design a new principle for classifying new samples and propose a Bayesian framework to iteratively minimize the cost function. We demonstrate the power of our new method by comparing it with 16 widely used classifiers (e.g. Support Vector Machine, k-nearest neighbors, RELIEF, etc.) including some classifiers that are capable of identifying interaction terms (e.g. SODA, hierNet, etc.) on synthetic dataset, five gene expression datasets, and twenty UCI machine learning datasets. Our method is able to outperform other methods in most cases.
On Theory for BART
Rockova, Veronika, Saha, Enakshi
Ensemble learning is a statistical paradigm built on the premise that many weak learners can perform exceptionally well when deployed collectively. The BART method of Chipman et al. (2010) is a prominent example of Bayesian ensemble learning, where each learner is a tree. Due to its impressive performance, BART has received a lot of attention from practitioners. Despite its wide popularity, however, theoretical studies of BART have begun emerging only very recently. Laying the foundations for the theoretical analysis of Bayesian forests, Rockova and van der Pas (2017) showed optimal posterior concentration under conditionally uniform tree priors. These priors deviate from the actual priors implemented in BART. Here, we study the exact BART prior and propose a simple modification so that it also enjoys optimality properties. To this end, we dive into branching process theory. We obtain tail bounds for the distribution of total progeny under heterogeneous Galton-Watson (GW) processes exploiting their connection to random walks. We conclude with a result stating the optimal rate of posterior convergence for BART.
Projective Inference in High-dimensional Problems: Prediction and Feature Selection
Piironen, Juho, Paasiniemi, Markus, Vehtari, Aki
This paper discusses predictive inference and feature selection for generalized linear models with scarce but high-dimensional data. We argue that in many cases one can benefit from a decision theoretically justified two-stage approach: first, construct a possibly non-sparse model that predicts well, and then find a minimal subset of features that characterize the predictions. The model built in the first step is referred to as the \emph{reference model} and the operation during the latter step as predictive \emph{projection}. The key characteristic of this approach is that it finds an excellent tradeoff between sparsity and predictive accuracy, and the gain comes from utilizing all available information including prior and that coming from the left out features. We review several methods that follow this principle and provide novel methodological contributions. We present a new projection technique that unifies two existing techniques and is both accurate and fast to compute. We also propose a way of evaluating the feature selection process using fast leave-one-out cross-validation that allows for easy and intuitive model size selection. Furthermore, we prove a theorem that helps to understand the conditions under which the projective approach could be beneficial. The benefits are illustrated via several simulated and real world examples.
Inhibited Softmax for Uncertainty Estimation in Neural Networks
Możejko, Marcin, Susik, Mateusz, Karczewski, Rafał
We present a new method for uncertainty estimation and out-of-distribution detection in neural networks with softmax output. We extend softmax layer with an additional constant input. The corresponding additional output is able to represent the uncertainty of the network. The proposed method requires neither additional parameters nor multiple forward passes nor input preprocessing nor out-of-distribution datasets. We show that our method performs comparably to more computationally expensive methods and outperforms baselines on our experiments from image recognition and sentiment analysis domains. The applications of computational learning systems might cause intrusive effects if we assume that predictions are always as accurate as during the experimental phase. Examples include misclassified traffic signs (Evtimov et al., 2018) and an image tagger that classified two African Americans as gorillas (Curtis, 2015). This is often caused by overconfidence of models that has been observed in the case of deep neural networks (Guo et al., 2017). Such malfunctions can be prevented if we estimate correctly the uncertainty of the machine learning system.
A Bayesian model for sparse graphs with flexible degree distribution and overlapping community structure
Lee, Juho, James, Lancelot F., Choi, Seungjin, Caron, François
We consider a non-projective class of inhomogeneous random graph models with interpretable parameters and a number of interesting asymptotic properties. Using the results of Bollob\'as et al. [2007], we show that i) the class of models is sparse and ii) depending on the choice of the parameters, the model is either scale-free, with power-law exponent greater than 2, or with an asymptotic degree distribution which is power-law with exponential cut-off. We propose an extension of the model that can accommodate an overlapping community structure. Scalable posterior inference can be performed due to the specific choice of the link probability. We present experiments on five different real-world networks with up to 100,000 nodes and edges, showing that the model can provide a good fit to the degree distribution and recovers well the latent community structure.