Goto

Collaborating Authors

 Bayesian Learning


Stretching the Effectiveness of MLE from Accuracy to Bias for Pairwise Comparisons

arXiv.org Machine Learning

A number of applications (e.g., AI bot tournaments, sports, peer grading, crowdsourcing) use pairwise comparison data and the Bradley-Terry-Luce (BTL) model to evaluate a given collection of items (e.g., bots, teams, students, search results). Past work has shown that under the BTL model, the widely-used maximum-likelihood estimator (MLE) is minimax-optimal in estimating the item parameters, in terms of the mean squared error. However, another important desideratum for designing estimators is fairness. In this work, we consider fairness modeled by the notion of bias in statistics. We show that the MLE incurs a suboptimal rate in terms of bias. We then propose a simple modification to the MLE, which "stretches" the bounding box of the maximum-likelihood optimizer by a small constant factor from the underlying ground truth domain. We show that this simple modification leads to an improved rate in bias, while maintaining minimax-optimality in the mean squared error. In this manner, our proposed class of estimators provably improves fairness represented by bias without loss in accuracy.


Likelihood-free approximate Gibbs sampling

arXiv.org Machine Learning

Likelihood-free methods refer to procedures that perform likelihood-based statistical inference, but without direct evaluation of the likelihood function. This is attractive when the likelihood function is computationally prohibitive to evaluate due to dataset size or model complexity, or when the likelihood function is only known through a data generation process. Some classes of likelihood-free methods include pseudo-marginal methods (Beaumont 2003; Andrieu and Roberts 2009), indirect inference (Gourieroux et al. 1993) and approximate Bayesian computation (Sisson et al. 2018a). In particular, approximate Bayesian computation (ABC) methods form an approximation to the computationally intractable posterior distribution by firstly sampling parameter vectors from the prior, and conditional on these, generating synthetic datasets under the model. The parameter vectors are then weighted by how well a vector of summary statistics of the synthetic datasets matches the same summary statistics of the observed data. ABC methods have seen extensive application and development over the past 15 years.


Self-Supervised Exploration via Disagreement

arXiv.org Artificial Intelligence

Efficient exploration is a long-standing problem in sensorimotor learning. Major advances have been demonstrated in noise-free, non-stochastic domains such as video games and simulation. However, most of these formulations either get stuck in environments with stochastic dynamics or are too inefficient to be scalable to real robotics setups. In this paper, we propose a formulation for exploration inspired by the work in active learning literature. Specifically, we train an ensemble of dynamics models and incentivize the agent to explore such that the disagreement of those ensembles is maximized. This allows the agent to learn skills by exploring in a self-supervised manner without any external reward. Notably, we further leverage the disagreement objective to optimize the agent's policy in a differentiable manner, without using reinforcement learning, which results in a sample-efficient exploration. We demonstrate the efficacy of this formulation across a variety of benchmark environments including stochastic-Atari, Mujoco and Unity. Finally, we implement our differentiable exploration on a real robot which learns to interact with objects completely from scratch. Project videos and code are at https://pathak22.github.io/exploration-by-disagreement/


Radial Prediction Layer

arXiv.org Machine Learning

For a broad variety of critical applications, it is essential to know how confident a classification prediction is. In this paper, we discuss the drawbacks of softmax to calculate class probabilities and to handle uncertainty in Bayesian neural networks. We introduce a new kind of prediction layer called radial prediction layer (RPL) to overcome these issues. In contrast to the softmax classification, RPL is based on the open-world assumption. Therefore, the class prediction probabilities are much more meaningful to assess the uncertainty concerning the novelty of the input. We show that neural networks with RPLs can be learned in the same way as neural networks using softmax. On a 2D toy data set (spiral data), we demonstrate the fundamental principles and advantages. On the real-world ImageNet data set, we show that the open-world properties are beneficially fulfilled. Additionally, we show that RPLs are less sensible to adversarial attacks on the MNIST data set. Due to its features, we expect RPL to be beneficial in a broad variety of applications, especially in critical environments, such as medicine or autonomous driving.


Bayesian Tensor Filtering: Smooth, Locally-Adaptive Factorization of Functional Matrices

arXiv.org Machine Learning

We consider the problem of functional matrix factorization, finding low-dimensional structure in a matrix where every entry is a noisy function evaluated at a set of discrete points. Such problems arise frequently in drug discovery, where biological samples form the rows, candidate drugs form the columns, and entries contain the dose-response curve of a sample treated at different concentrations of a drug. We propose Bayesian Tensor Filtering (BTF), a hierarchical Bayesian model of matrices of functions. BTF captures the smoothness in each individual function while also being locally adaptive to sharp discontinuities. The BTF model is agnostic to the likelihood of the underlying observations, making it flexible enough to handle many different kinds of data. We derive efficient Gibbs samplers for three classes of likelihoods: (i) Gaussian, for which updates are fully conjugate; (ii) Binomial and related likelihoods, for which updates are conditionally conjugate through P{\'o}lya--Gamma augmentation; and (iii) Black-box likelihoods, for which updates are non-conjugate but admit an analytic truncated elliptical slice sampling routine. We compare BTF against a state-of-the-art method for dynamic Poisson matrix factorization, showing BTF better reconstructs held out data in synthetic experiments. Finally, we build a dose-response model around BTF and show on real data from a multi-sample, multi-drug cancer study that BTF outperforms the current standard approach in biology. Code for BTF is available at https://github.com/tansey/functionalmf.


Errors-in-variables Modeling of Personalized Treatment-Response Trajectories

arXiv.org Machine Learning

Estimating the effect of a treatment on a given outcome, conditioned on a vector of covariates, is central in many applications. However, learning the impact of a treatment on a continuous temporal response, when the covariates suffer extensively from measurement error and even the timing of the treatments is uncertain, has not been addressed. We introduce a novel data-driven method that can estimate treatment-response trajectories in this challenging scenario. We model personalized treatment-response curves as a combination of parametric response functions, hierarchically sharing information across individuals, and a sparse Gaussian process for the baseline trend. Importantly, our model considers measurement error not only in treatment covariates, but also in treatment times, a problem which arises in practice for example when treatment information is based on self-reporting. In a challenging and timely problem of estimating the impact of diet on continuous blood glucose measurements, our model leads to significant improvements in estimation accuracy and prediction.


Stochastic Neural Network with Kronecker Flow

arXiv.org Machine Learning

Recent advances in variational inference enable the modelling of highly structured joint distributions, but are limited in their capacity to scale to the high-dimensional setting of stochastic neural networks. This limitation motivates a need for scalable parameterizations of the noise generation process, in a manner that adequately captures the dependencies among the various parameters. In this work, we address this need and present the Kronecker Flow, a generalization of the Kronecker product to invertible mappings designed for stochastic neural networks. We apply our method to variational Bayesian neural networks on predictive tasks, PAC-Bayes generalization bound estimation, and approximate Thompson sampling in contextual bandits. In all setups, our methods prove to be competitive with existing methods and better than the baselines.


Bayesian experimental design using regularized determinantal point processes

arXiv.org Machine Learning

In experimental design, we are given $n$ vectors in $d$ dimensions, and our goal is to select $k\ll n$ of them to perform expensive measurements, e.g., to obtain labels/responses, for a linear regression task. Many statistical criteria have been proposed for choosing the optimal design, with popular choices including A- and D-optimality. If prior knowledge is given, typically in the form of a $d\times d$ precision matrix $\mathbf A$, then all of the criteria can be extended to incorporate that information via a Bayesian framework. In this paper, we demonstrate a new fundamental connection between Bayesian experimental design and determinantal point processes, the latter being widely used for sampling diverse subsets of data. We use this connection to develop new efficient algorithms for finding $(1+\epsilon)$-approximations of optimal designs under four optimality criteria: A, C, D and V. Our algorithms can achieve this when the desired subset size $k$ is $\Omega(\frac{d_{\mathbf A}}{\epsilon} + \frac{\log 1/\epsilon}{\epsilon^2})$, where $d_{\mathbf A}\leq d$ is the $\mathbf A$-effective dimension, which can often be much smaller than $d$. Our results offer direct improvements over a number of prior works, for both Bayesian and classical experimental design, in terms of algorithm efficiency, approximation quality, and range of applicable criteria.


Goodness-of-fit Test for Latent Block Models

arXiv.org Machine Learning

Latent Block Models are used for probabilistic biclustering, which is shown to be an effective method for analyzing various relational data sets. However, there has been no statistical test method for determining the row and column cluster numbers of Latent Block Models. Recent studies have constructed statistical-test-based methods for Stochastic Block Models, in which we assume that the observed matrix is a square symmetric matrix and that the cluster assignments are the same for rows and columns. In this paper, we develop a goodness-of-fit test for Latent Block Models, which tests whether an observed data matrix fits a given set of row and column cluster numbers, or it consists of more clusters in at least one direction of row and column. To construct the test method, we use a result from random matrix theory for a sample covariance matrix. We show experimentally the effectiveness of our proposed method, by showing the asymptotic behavior of the test statistic and the test accuracy.


Learning Fair Naive Bayes Classifiers by Discovering and Eliminating Discrimination Patterns

arXiv.org Artificial Intelligence

As machine learning is increasingly used to make real-world decisions, recent research efforts aim to define and ensure fairness in algorithmic decision making. Existing methods often assume a fixed set of observable features to define individuals, but lack a discussion of certain features not being observed at test time. In this paper, we study fairness of naive Bayes classifiers, which allow partial observations. In particular, we introduce the notion of a discrimination pattern, which refers to an individual receiving different classifications depending on whether some sensitive attributes were observed. Then a model is considered fair if it has no such pattern. We propose an algorithm to discover and mine for discrimination patterns in a naive Bayes classifier, and show how to learn maximum-likelihood parameters subject to these fairness constraints. Our approach iteratively discovers and eliminates discrimination patterns until a fair model is learned. An empirical evaluation on three real-world datasets demonstrates that we can remove exponentially many discrimination patterns by only adding a small fraction of them as constraints.