Goto

Collaborating Authors

 Directed Networks


Deep Exploration via Randomized Value Functions

arXiv.org Machine Learning

We study the use of randomized value functions to guide deep exploration in reinforcement learning. This offers an elegant means for synthesizing statistically and computationally efficient exploration with common practical approaches to value function learning. We present several reinforcement learning algorithms that leverage randomized value functions and demonstrate their efficacy through computational studies. We also prove a regret bound that establishes statistical efficiency with a tabular representation.


LogitBoost autoregressive networks

arXiv.org Machine Learning

Multivariate binary distributions can be decomposed into products of univariate conditional distributions. Recently popular approaches have modeled these conditionals through neural networks with sophisticated weight-sharing structures. It is shown that state-of-the-art performance on several standard benchmark datasets can actually be achieved by training separate probability estimators for each dimension. In that case, model training can be trivially parallelized over data dimensions. On the other hand, complexity control has to be performed for each learned conditional distribution. Three possible methods are considered and experimentally compared. The estimator that is employed for each conditional is LogitBoost. Similarities and differences between the proposed approach and autoregressive models based on neural networks are discussed in detail.


Targeting Bayes factors with direct-path non-equilibrium thermodynamic integration

arXiv.org Machine Learning

Thermodynamic integration (TI) for computing marginal likelihoods is based on an inverse annealing path from the prior to the posterior distribution. In many cases, the resulting estimator suffers from high variability, which particularly stems from the prior regime. When comparing complex models with differences in a comparatively small number of parameters, intrinsic errors from sampling fluctuations may outweigh the differences in the log marginal likelihood estimates. In the present article, we propose a thermodynamic integration scheme that directly targets the log Bayes factor. The method is based on a modified annealing path between the posterior distributions of the two models compared, which systematically avoids the high variance prior regime. We combine this scheme with the concept of non-equilibrium TI to minimise discretisation errors from numerical integration. Results obtained on Bayesian regression models applied to standard benchmark data, and a complex hierarchical model applied to biopathway inference, demonstrate a significant reduction in estimator variance over state-of-the-art TI methods.


Overcoming model simplifications when quantifying predictive uncertainty

arXiv.org Machine Learning

It is generally accepted that all models are wrong -- the difficulty is determining which are useful. Here, a useful model is considered as one that is capable of combining data and expert knowledge, through an inversion or calibration process, to adequately characterize the uncertainty in predictions of interest. This paper derives conditions that specify which simplified models are useful and how they should be calibrated. To start, the notion of an optimal simplification is defined. This relates the model simplifications to the nature of the data and predictions, and determines when a standard probabilistic calibration scheme is capable of accurately characterizing uncertainty. Furthermore, two additional conditions are defined for suboptimal models that determine when the simplifications can be safely ignored. The first allows a suboptimally simplified model to be used in a way that replicates the performance of an optimal model. This is achieved through the judicial selection of a prior term for the calibration process that explicitly includes the nature of the data, predictions and modelling simplifications. The second considers the dependency structure between the predictions and the available data to gain insights into when the simplifications can be overcome by using the right calibration data. Furthermore, the derived conditions are related to the commonly used calibration schemes based on Tikhonov and subspace regularization. To allow concrete insights to be obtained, the analysis is performed under a linear expansion of the model equations and where the predictive uncertainty is characterized via second order moments only.


A Survey of Available Corpora for Building Data-Driven Dialogue Systems

arXiv.org Artificial Intelligence

During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.


Independence clustering (without a matrix)

arXiv.org Machine Learning

The independence clustering problem is considered in the following formulation: given a set $S$ of random variables, it is required to find the finest partitioning $\{U_1,\dots,U_k\}$ of $S$ into clusters such that the clusters $U_1,\dots,U_k$ are mutually independent. Since mutual independence is the target, pairwise similarity measurements are of no use, and thus traditional clustering algorithms are inapplicable. The distribution of the random variables in $S$ is, in general, unknown, but a sample is available. Thus, the problem is cast in terms of time series. Two forms of sampling are considered: i.i.d.\ and stationary time series, with the main emphasis being on the latter, more general, case. A consistent, computationally tractable algorithm for each of the settings is proposed, and a number of open directions for further research are outlined.


Bayesian Adaptive Data Analysis Guarantees from Subgaussianity

arXiv.org Machine Learning

The new field of adaptive data analysis seeks to provide algorithms and provable guarantees for models of machine learning that allow researchers to reuse their data, which normally falls outside of the usual statistical paradigm of static data analysis. In 2014, Dwork, Feldman, Hardt, Pitassi, Reingold and Roth introduced one potential model and proposed several solutions based on differential privacy. In previous work in 2016, we described a problem with this model and instead proposed a Bayesian variant, but also found that the analogous Bayesian methods cannot achieve the same statistical guarantees as in the static case. In this paper, we prove the first positive results for the Bayesian model, showing that with a Dirichlet prior, the posterior mean algorithm indeed matches the statistical guarantees of the static case. The main ingredient is a new theorem showing that the $\mathrm{Beta}(\alpha,\beta)$ distribution is subgaussian with variance proxy $O(1/(\alpha+\beta+1))$, a concentration result also of independent interest. We provide two proofs of this result: a probabilistic proof utilizing a simple condition for the raw moments of a positive random variable and a learning-theoretic proof based on considering the beta distribution as a posterior, both of which have implications to other related problems.


Challenges in Bayesian Adaptive Data Analysis

arXiv.org Machine Learning

Traditional statistical analysis requires that the analysis process and data are independent. By contrast, the new field of adaptive data analysis hopes to understand and provide algorithms and accuracy guarantees for research as it is commonly performed in practice, as an iterative process of interacting repeatedly with the same data set, such as repeated tests against a holdout set. Previous work has defined a model with a rather strong lower bound on sample complexity in terms of the number of queries, $n\sim\sqrt q$, arguing that adaptive data analysis is much harder than static data analysis, where $n\sim\log q$ is possible. Instead, we argue that those strong lower bounds point to a limitation of the previous model in that it must consider wildly asymmetric scenarios which do not hold in typical applications. To better understand other difficulties of adaptivity, we propose a new Bayesian version of the problem that mandates symmetry. Since the other lower bound techniques are ruled out, we can more effectively see difficulties that might otherwise be overshadowed. As a first contribution to this model, we produce a new problem using error-correcting codes on which a large family of methods, including all previously proposed algorithms, require roughly $n\sim\sqrt[4]q$. These early results illustrate new difficulties in adaptive data analysis regarding slightly correlated queries on problems with concentrated uncertainty.


Finding Statistically Significant Attribute Interactions

arXiv.org Machine Learning

In many data exploration tasks it is meaningful to identify groups of attribute interactions that are specific to a variable of interest. For instance, in a dataset where the attributes are medical markers and the variable of interest (class variable) is binary indicating presence/absence of disease, we would like to know which medical markers interact with respect to the binary class label. These interactions are useful in several practical applications, for example, to gain insight into the structure of the data, in feature selection, and in data anonymisation. We present a novel method, based on statistical significance testing, that can be used to verify if the data set has been created by a given factorised class-conditional joint distribution, where the distribution is parametrised by a partition of its attributes. Furthermore, we provide a method, named astrid, for automatically finding a partition of attributes describing the distribution that has generated the data. State-of-the-art classifiers are utilised to capture the interactions present in the data by systematically breaking attribute interactions and observing the effect of this breaking on classifier performance. We empirically demonstrate the utility of the proposed method with examples using real and synthetic data.


Learning Summary Statistic for Approximate Bayesian Computation via Deep Neural Network

arXiv.org Machine Learning

Approximate Bayesian Computation (ABC) methods are used to approximate posterior distributions in models with unknown or computationally intractable likelihoods. Both the accuracy and computational efficiency of ABC depend on the choice of summary statistic, but outside of special cases where the optimal summary statistics are known, it is unclear which guiding principles can be used to construct effective summary statistics. In this paper we explore the possibility of automating the process of constructing summary statistics by training deep neural networks to predict the parameters from artificially generated data: the resulting summary statistics are approximately posterior means of the parameters. With minimal model-specific tuning, our method constructs summary statistics for the Ising model and the moving-average model, which match or exceed theoretically-motivated summary statistics in terms of the accuracies of the resulting posteriors.