Bayesian Learning
Scalable Bayesian Non-linear Matrix Completion
Qin, Xiangju, Blomstedt, Paul, Kaski, Samuel
Matrix completion aims to predict missing elements in a partially observed data matrix which in typical applications, such as collaborative filtering, is large and extremely sparsely observed. A standard solution is matrix factorization, which predicts unobserved entries as linear combinations of latent variables. We generalize to nonlinear combinations in massive-scale matrices. Bayesian approaches have been proven beneficial in linear matrix completion, but not applied in the more general nonlinear case, due to limited scalability. We introduce a Bayesian nonlinear matrix completion algorithm, which is based on a recent Bayesian formulation of Gaussian process latent variable models. To solve the challenges regarding scalability and computation, we propose a data-parallel distributed computational approach with a restricted communication scheme. We evaluate our method on challenging out-of-matrix prediction tasks using both simulated and real-world data. 1 Introduction In matrix completion--one of the most widely used approaches for collaborative filtering--the objective is to predict missing elements of a partially observed data matrix.
Neural Network based Explicit Mixture Models and Expectation-maximization based Learning
Liu, Dong, Vu, Minh Thành, Chatterjee, Saikat, Rasmussen, Lars K.
We propose two neural network based mixture models in this article. The proposed mixture models are explicit in nature. The explicit models have analytical forms with the advantages of computing likelihood and efficiency of generating samples. Computation of likelihood is an important aspect of our models. Expectation-maximization based algorithms are developed for learning parameters of the proposed models. We provide sufficient conditions to realize the expectation-maximization based learning. The main requirements are invertibility of neural networks that are used as generators and Jacobian computation of functional form of the neural networks. The requirements are practically realized using a flow-based neural network. In our first mixture model, we use multiple flow-based neural networks as generators. Naturally the model is complex. A single latent variable is used as the common input to all the neural networks. The second mixture model uses a single flow-based neural network as a generator to reduce complexity. The single generator has a latent variable input that follows a Gaussian mixture distribution. We demonstrate efficiency of proposed mixture models through extensive experiments for generating samples and maximum likelihood based classification.
ML DL AI DS BD - An Introduction
In an image recognition application, the raw input may be a matrix of pixels; the first representational layer may abstract the pixels and encode edges; the second layer may compose and encode arrangements of edges; the third layer may encode a nose and eyes; and the fourth layer may recognize that the image contains a face. Importantly, a deep learning process can learn which features to optimally place in which level on its own.
Multi-agent Inverse Reinforcement Learning for Two-person Zero-sum Games
Lin, Xiaomin, Beling, Peter A., Cogill, Randy
The focus of this paper is a Bayesian framework for solving a class of problems termed multi-agent inverse reinforcement learning (MIRL). Compared to the well-known inverse reinforcement learning (IRL) problem, MIRL is formalized in the context of stochastic games, which generalize Markov decision processes to game theoretic scenarios. We establish a theoretical foundation for competitive two-agent zero-sum MIRL problems and propose a Bayesian solution approach in which the generative model is based on an assumption that the two agents follow a minimax bi-policy. Numerical results are presented comparing the Bayesian MIRL method with two existing methods in the context of an abstract soccer game. Investigation centers on relationships between the extent of prior information and the quality of learned rewards. Results suggest that covariance structure is more important than mean value in reward priors.
Uncertainty in Model-Agnostic Meta-Learning using Variational Inference
Nguyen, Cuong, Do, Thanh-Toan, Carneiro, Gustavo
Thanh-Toan Do University of Liverpool thanh-toan.do@liverpool.ac.uk Abstract W e introduce a new, rigorously-formulated Bayesian meta-learning algorithm that learns a probability distribution of model parameter prior for few-shot learning. The proposed algorithm employs a gradient-based variational inference to infer the posterior of model parameters to a new task. Our algorithm can be applied to any model architecture and can be implemented in various machine learning paradigms, including regression and classification. W e show that the models trained with our proposed meta-learning algorithm are well calibrated and accurate, with state-of-the-art calibration and classification results on two few-shot classification benchmarks (Omniglot and Mini-ImageNet), and competitive results in a multi-modal task-distribution regression. 1. Introduction Machine learning, in particular deep learning, has thrived during the last decade, producing results that were previously considered to be infeasible in several areas. For instance, outstanding results have been achieved in speech and image understanding [1-4], and medical image analysis [5]. However, the development of these machine learning methods typically requires a large number of training samples to achieve notable performance. Such requirement contrasts with the human ability of quickly adapting to new learning tasks using few "training" samples. This difference may be due to the fact that humans tend to exploit prior knowledge to facilitate the learning of new tasks, while machine learning algorithms often do not use any prior knowledge (e.g., training from scratch with random initialisation) [6] or rely on weak prior knowledge to learn new tasks (e.g., training from pre-trained models) [7]. This challenge has motivated the design of machine learning methods that can make more effective use of prior knowledge to adapt to new learning tasks using few training samples [8].
Variational f-divergence Minimization
Zhang, Mingtian, Bird, Thomas, Habib, Raza, Xu, Tianlin, Barber, David
Probabilistic models are often trained by maximum likelihood, which corresponds to minimizing a specific f-divergence between the model and data distribution. In light of recent successes in training Generative Adversarial Networks, alternative non-likelihood training criteria have been proposed. Whilst not necessarily statistically efficient, these alternatives may better match user requirements such as sharp image generation. A general variational method for training probabilistic latent variable models using maximum likelihood is well established; however, how to train latent variable models using other f-divergences is comparatively unknown. We discuss a variational approach that, when combined with the recently introduced Spread Divergence, can be applied to train a large class of latent variable models using any f-divergence.
Multi-turn Dialogue Response Generation with Autoregressive Transformer Models
Olabiyi, Oluwatobi, Mueller, Erik T.
Neural dialogue models, despite their successes, still suffer from lack of relevance, diversity, and in many cases coherence in their generated responses. These issues have been attributed to reasons including (1) short-range model architectures that capture limited temporal dependencies, (2) limitations of the maximum likelihood training objective, (3) the concave entropy profile of dialogue datasets resulting into short and generic responses, and (4) out-of-vocabulary problem leading to generation of a large number of $<$UNK$>$ tokens. Autoregressive transformer models such as GPT-2, although trained with the maximum likelihood objective, do not suffer from the out-of-vocabulary problem and have demonstrated an excellent ability to capture long-range structures in language modeling tasks. In this paper, we examine the use of autoregressive transformer models for multi-turn dialogue response generation. In our experiments, we employ small and medium GPT-2 models (with publicly available pretrained language model parameters) on the open-domain Movie Triples dataset and the closed-domain Ubuntu Dialogue dataset. The models (with and without pretraining) achieve significant improvements over the baselines for multi-turn dialogue response generation. They also produce state-of-the-art performance on the two datasets based on several metrics, including BLEU, ROGUE, and distinct n-gram.
Adaptively stacking ensembles for influenza forecasting with incomplete data
McAndrew, Thomas, Reich, Nicholas G.
Seasonal influenza infects between 10 and 50 million people in the United States every year, overburdening hospitals during weeks of peak incidence. Named by the CDC as an important tool to fight the damaging effects of these epidemics, accurate forecasts of influenza and influenza-like illness (ILI) forewarn public health officials about when, and where, seasonal influenza outbreaks will hit hardest. Multi-model ensemble forecasts---weighted combinations of component models---have shown positive results in forecasting. Ensemble forecasts of influenza outbreaks have been static, training on all past ILI data at the beginning of a season, generating a set of optimal weights for each model in the ensemble, and keeping the weights constant. We propose an adaptive ensemble forecast that (i) changes model weights week-by-week throughout the influenza season, (ii) only needs the current influenza season's data to make predictions, and (iii) by introducing a prior distribution, shrinks weights toward the reference equal weighting approach and adjusts for observed ILI percentages that are subject to future revisions. We investigate the prior's ability to impact adaptive ensemble performance and, after finding an optimal prior via a cross-validation approach, compare our adaptive ensemble's performance to equal-weighted and static ensembles. Applied to forecasts of short-term ILI incidence at the regional and national level in the US, our adaptive model outperforms a naive equal-weighted ensemble, and has similar or better performance to the static ensemble, which requires multiple years of training data. Adaptive ensembles are able to quickly train and forecast during epidemics, and provide a practical tool to public health officials looking for forecasts that can conform to unique features of a specific season.
Bayesian Robustness: A Nonasymptotic Viewpoint
Bhatia, Kush, Ma, Yi-An, Dragan, Anca D., Bartlett, Peter L., Jordan, Michael I.
The goal is to capture the sensitivity of inferential proc edures to the presence of outliers in the data and misspecifications in the modelling a ssumptions, and to mitigate overly large sensitivity. The Bayesian approach has been fo cused on capturing possible anomalies in the observed data via the model and in choosing p riors that have minimal effect on inferences. The frequentist approach, on the other hand, has focused on the development of estimators that identify and guard against o utliers in the data. We refer the reader to [ Hub11, Chap 15] for a comprehensive discussion.
von Neumann-Morgenstern and Savage Theorems for Causal Decision Making
Gonzalez-Soto, Mauricio, Sucar, Luis E., Escalante, Hugo J.
Decision making under uncertain conditions has been well studied when uncertainty can only be considered at the associative level of information. The classical Theorems of von Neumann-Morgenstern and Savage provide a formal criterion for rationally making choices using associative information. We provide here a previous result from Pearl and show that it can be considered as a causal version of the von Neumann-Morgenstern Theorem; furthermore, we consider the case when the true causal mechanism that controls the environment is unknown to the decision maker and propose a causal version of the Savage Theorem. As applications, we argue how previous optimal action learning methods for causal environments fit within the Causal Savage Theorem we present thus showing the utility of our result in the justification and design of learning algorithms; furthermore, we define a Causal Nash Equilibria for a strategic game in a causal environment in terms of the preferences induced by our Causal Decision Making Theorem.