Bayesian Inference
A Full Probabilistic Model for Yes/No Type Crowdsourcing in Multi-Class Classification
Saldias-Fuentes, Belen, Protopapas, Pavlos, Pichara, Karim
Crowdsourcing has become widely used in supervised scenarios where training sets are scarce and difficult to obtain. Most crowdsourcing models in the literature assume labelers can provide answers to full questions. In classification contexts, full questions require a labeler to discern among all possible classes. Unfortunately, discernment is not always easy in realistic scenarios. Labelers may not be experts in differentiating all classes. In this work, we provide a full probabilistic model for a shorter type of queries. Our shorter queries only require "yes" or "no" responses. Our model estimates a joint posterior distribution of matrices related to labelers' confusions and the posterior probability of the class of every object. We developed an approximate inference approach, using Monte Carlo Sampling and Black Box Variational Inference, which provides the derivation of the necessary gradients. We built two realistic crowdsourcing scenarios to test our model. The first scenario queries for irregular astronomical time-series. The second scenario relies on the image classification of animals. We achieved results that are comparable with those of full query crowdsourcing. Furthermore, we show that modeling labelers' failures plays an important role in estimating true classes. Finally, we provide the community with two real datasets obtained from our crowdsourcing experiments. All our code is publicly available.
Positions for Exceptional Doctoral Students (deadline January 31, 2019)
The Helsinki Doctoral Education Network in Information and Communications Technology (HICT) is a joint initiative by Aalto University and the University of Helsinki, the two leading universities within this area in Finland. The network involves at present over 60 professors and over 200 doctoral students, and the participating units graduate altogether more than 40 new doctors each year. The quality of research and education in both HICT universities is world-class, and the education is practically free as there are no tuition fees for doctoral students in the Finnish university system. In terms of the living environment, Helsinki has been ranked as one of the world's top-10 most livable cities (Economist, 2017), and Finland is among the best countries in the world with respect to many quality of life indicators, including being the overall #1 country in human wellbeing. Helsinki is in the second place in the world's startup city comparison (Valuer, 2018) and is also the Mobile Data Capital of the World (IEEE Spectrum, 2018). The participating units of HICT have currently funding available for exceptionally qualified doctoral students. We offer the possibility to join world-class research groups, with multiple interesting research projects to choose from. If you wish to be considered as a potential new doctoral student in HICT you can apply to one or a number of doctoral student positions (listed below).
Variational Characterizations of Local Entropy and Heat Regularization in Deep Learning
Trillos, Nicolas Garcia, Kaplan, Zach, Sanz-Alonso, Daniel
The aim of this paper is to provide new theoretical and computational understanding on two loss regularizations employed in deep learning, known as local entropy and heat regularization. For both regularized losses we introduce variational characterizations that naturally suggest a two-step scheme for their optimization, based on the iterative shift of a probability density and the calculation of a best Gaussian approximation in Kullback-Leibler divergence. Under this unified light, the optimization schemes for local entropy and heat regularized loss differ only over which argument of the Kullback-Leibler divergence is used to find the best Gaussian approximation. Local entropy corresponds to minimizing over the second argument, and the solution is given by moment matching. This allows to replace traditional back-propagation calculation of gradients by sampling algorithms, opening an avenue for gradient-free, parallelizable training of neural networks.
Testing Conditional Predictive Independence in Supervised Learning Algorithms
Watson, David S., Wright, Marvin N.
We propose a general test of conditional independence. The conditional predictive impact (CPI) is a provably consistent and unbiased estimator of one or several features' association with a given outcome, conditional on a (potentially empty) reduced feature set. The measure can be calculated using any supervised learning algorithm and loss function. It relies on no parametric assumptions and applies equally well to continuous and categorical predictors and outcomes. The CPI can be efficiently computed for low- or high-dimensional data without any sparsity constraints. We illustrate PAC-Bayesian convergence rates for the CPI and develop statistical inference procedures for evaluating its magnitude, significance, and precision. These tests aid in feature and model selection, extending traditional frequentist and Bayesian techniques to general supervised learning tasks. The CPI may also be used in conjunction with causal discovery algorithms to identify underlying graph structures for multivariate systems. We test our method in conjunction with various algorithms, including linear regression, neural networks, random forests, and support vector machines. Empirical results show that the CPI compares favorably to alternative variable importance measures and other nonparametric tests of conditional independence on a diverse array of real and simulated datasets. Simulations confirm that our inference procedures successfully control Type I error and achieve nominal coverage probability. Our method has been implemented in an R package, cpi, which can be downloaded from https://github.com/dswatson/cpi.
Improved Accounting for Differentially Private Learning
Triastcyn, Aleksei, Faltings, Boi
We consider the problem of differential privacy accounting, i.e. estimation of privacy loss bounds, in machine learning in a broad sense. We propose two versions of a generic privacy accountant suitable for a wide range of learning algorithms. Both versions are derived in a simple and principled way using well-known tools from probability theory, such as concentration inequalities. We demonstrate that our privacy accountant is able to achieve state-of-the-art estimates of DP guarantees and can be applied to new areas like variational inference. Moreover, we show that the latter enjoys differential privacy at minor cost.
Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis
Tsuzuku, Yusuke, Sato, Issei, Sugiyama, Masashi
The notion of flat minima has played a key role in the generalization studies of deep learning models. However, existing definitions of the flatness are known to be sensitive to the rescaling of parameters. The issue suggests that the previous definitions of the flatness might not be a good measure of generalization, because generalization is invariant to such rescalings. In this paper, from the PAC-Bayesian perspective, we scrutinize the discussion concerning the flat minima and introduce the notion of normalized flat minima, which is free from the known scale dependence issues. Additionally, we highlight the scale dependence of existing matrix-norm based generalization error bounds similar to the existing flat minima definitions. Our modified notion of the flatness does not suffer from the insufficiency, either, suggesting it might provide better hierarchy in the hypothesis class.
The CM Algorithm for the Maximum Mutual Information Classifications of Unseen Instances
The Maximum Mutual Information (MMI) criterion is different from the Least Error Rate (LER) criterion. It can reduce failing to report small probability events. This paper introduces the Channels Matching (CM) algorithm for the MMI classifications of unseen instances. It also introduces some semantic information methods, which base the CM algorithm. In the CM algorithm, label learning is to let the semantic channel match the Shannon channel (Matching I) whereas classifying is to let the Shannon channel match the semantic channel (Matching II). We can achieve the MMI classifications by repeating Matching I and II. For low-dimensional feature spaces, we only use parameters to construct n likelihood functions for n different classes (rather than to construct partitioning boundaries as gradient descent) and expresses the boundaries by numerical values. Without searching in parameter spaces, the computation of the CM algorithm for low-dimensional feature spaces is very simple and fast. Using a two-dimensional example, we test the speed and reliability of the CM algorithm by different initial partitions. For most initial partitions, two iterations can make the mutual information surpass 99% of the convergent MMI. The analysis indicates that for high-dimensional feature spaces, we may combine the CM algorithm with neural networks to improve the MMI classifications for faster and more reliable convergence.
Bayesian Learning of Neural Network Architectures
Dikov, Georgi, van der Smagt, Patrick, Bayer, Justin
In this paper we propose a Bayesian method for estimating architectural parameters of neural networks, namely layer size and network depth. We do this by learning concrete distributions over these parameters. Our results show that regular networks with a learnt structure can generalise better on small datasets, while fully stochastic networks can be more robust to parameter initialisation. The proposed method relies on standard neural variational learning and, unlike randomised architecture search, does not require a retraining of the model, thus keeping the computational overhead at minimum.
Markov Properties of Discrete Determinantal Point Processes
Sadeghi, Kayvan, Rinaldo, Alessandro
Determinantal point processes (DPPs) are probabilistic models for repulsion. When used to represent the occurrence of random subsets of a finite base set, DPPs allow to model global negative associations in a mathematically elegant and direct way. Discrete DPPs have become popular and computationally tractable models for solving several machine learning tasks that require the selection of diverse objects, and have been successfully applied in numerous real-life problems. Despite their popularity, the statistical properties of such models have not been adequately explored. In this note, we derive the Markov properties of discrete DPPs and show how they can be expressed using graphical models.
Bayesian surrogate learning in dynamic simulator-based regression problems
The estimation of unknown values of parameters (or hidden variables, control variables) that characterise a physical system often relies on the comparison of measured data with synthetic data produced by some numerical simulator of the system as the parameter values are varied. This process often encounters two major difficulties: the generation of synthetic data for each considered set of parameter values can be computationally expensive if the system model is complicated; and the exploration of the parameter space can be inefficient and/or incomplete, a typical example being when the exploration becomes trapped in a local optimum of the objection function that characterises the mismatch between the measured and synthetic data. A method to address both these issues is presented, whereby: a surrogate model (or proxy), which emulates the computationally expensive system simulator, is constructed using deep recurrent networks (DRN); and a nested sampling (NS) algorithm is employed to perform efficient and robust exploration of the parameter space. The analysis is performed in a Bayesian context, in which the samples characterise the full joint posterior distribution of the parameters, from which parameter estimates and uncertainties are easily derived. The proposed approach is compared with conventional methods in some numerical examples, for which the results demonstrate that one can accelerate the parameter estimation process by at least an order of magnitude.