Bayesian Inference
Bayes-Optimal Entropy Pursuit for Active Choice-Based Preference Learning
Pallone, Stephen N., Frazier, Peter I., Henderson, Shane G.
We analyze the problem of learning a single user's preferences in an active learning setting, sequentially and adaptively querying the user over a finite time horizon. Learning is conducted via choice-based queries, where the user selects her preferred option among a small subset of offered alternatives. These queries have been shown to be a robust and efficient way to learn an individual's preferences. We take a parametric approach and model the user's preferences through a linear classifier, using a Bayesian prior to encode our current knowledge of this classifier. The rate at which we learn depends on the alternatives offered at every time epoch. Under certain noise assumptions, we show that the Bayes-optimal policy for maximally reducing entropy of the posterior distribution of this linear classifier is a greedy policy, and that this policy achieves a linear lower bound when alternatives can be constructed from the continuum. Further, we analyze a different metric called misclassification error, proving that the performance of the optimal policy that minimizes misclassification error is bounded below by a linear function of differential entropy. Lastly, we numerically compare the greedy entropy reduction policy with a knowledge gradient policy under a number of scenarios, examining their performance under both differential entropy and misclassification error.
Control of Gene Regulatory Networks with Noisy Measurements and Uncertain Inputs
Imani, Mahdi, Braga-Neto, Ulisses
This paper is concerned with the problem of stochastic control of gene regulatory networks (GRNs) observed indirectly through noisy measurements and with uncertainty in the intervention inputs. The partial observability of the gene states and uncertainty in the intervention process are accounted for by modeling GRNs using the partially-observed Boolean dynamical system (POBDS) signal model with noisy gene expression measurements. Obtaining the optimal infinite-horizon control strategy for this problem is not attainable in general, and we apply reinforcement learning and Gaussian process techniques to find a near-optimal solution. The POBDS is first transformed to a directly-observed Markov Decision Process in a continuous belief space, and the Gaussian process is used for modeling the cost function over the belief and intervention spaces. Reinforcement learning then is used to learn the cost function from the available gene expression data. In addition, we employ sparsification, which enables the control of large partially-observed GRNs. The performance of the resulting algorithm is studied through a comprehensive set of numerical experiments using synthetic gene expression data generated from a melanoma gene regulatory network.
Scalable Inference for Nested Chinese Restaurant Process Topic Models
Chen, Jianfei, Zhu, Jun, Lu, Jie, Liu, Shixia
Nested Chinese Restaurant Process (nCRP) topic models are powerful nonparametric Bayesian methods to extract a topic hierarchy from a given text corpus, where the hierarchical structure is automatically determined by the data. Hierarchical Latent Dirichlet Allocation (hLDA) is a popular instance of nCRP topic models. However, hLDA has only been evaluated at small scale, because the existing collapsed Gibbs sampling and instantiated weight variational inference algorithms either are not scalable or sacrifice inference quality with mean-field assumptions. Moreover, an efficient distributed implementation of the data structures, such as dynamically growing count matrices and trees, is challenging. In this paper, we propose a novel partially collapsed Gibbs sampling (PCGS) algorithm, which combines the advantages of collapsed and instantiated weight algorithms to achieve good scalability as well as high model quality. An initialization strategy is presented to further improve the model quality. Finally, we propose an efficient distributed implementation of PCGS through vectorization, pre-processing, and a careful design of the concurrent data structures and communication strategy. Empirical studies show that our algorithm is 111 times more efficient than the previous open-source implementation for hLDA, with comparable or even better model quality. Our distributed implementation can extract 1,722 topics from a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than the previous largest corpus, with 50 machines in 7 hours.
The Mathematics of Machine Learning
In the last few months, I have had several people contact me about their enthusiasm for venturing into the world of data science and using Machine Learning (ML) techniques to probe statistical regularities and build impeccable data-driven products. However, I've observed that some actually lack the necessary mathematical intuition and framework to get useful results. This is the main reason I decided to write this blog post. Recently, there has been an upsurge in the availability of many easy-to-use machine and deep learning packages such as scikit-learn, Weka, Tensorflow etc. Machine Learning theory is a field that intersects statistical, probabilistic, computer science and algorithmic aspects arising from learning iteratively from data and finding hidden insights which can be used to build intelligent applications. Despite the immense possibilities of Machine and Deep Learning, a thorough mathematical understanding of many of these techniques is necessary for a good grasp of the inner workings of the algorithms and getting good results. There are many reasons why the mathematics of Machine Learning is important and I'll highlight some of them below: The main question when trying to understand an interdisciplinary field such as Machine Learning is the amount of maths necessary and the level of maths needed to understand these techniques.
Social Learning and Diffusion of Pervasive Goods: An Empirical Study of an African App Store
Nia, Meisam Hejazi, Ratchford, Brian T., Bruce, Norris
In this study, the authors develop a structural model that combines a macro diffusion model with a micro choice model to control for the effect of social influence on the mobile app choices of customers over app stores. Social influence refers to the density of adopters within the proximity of other customers. Using a large data set from an African app store and Bayesian estimation methods, the authors quantify the effect of social influence and investigate the impact of ignoring this process in estimating customer choices. The findings show that customer choices in the app store are explained better by offline than online density of adopters and that ignoring social influence in estimations results in biased estimates. Furthermore, the findings show that the mobile app adoption process is similar to adoption of music CDs, among all other classic economy goods. A counterfactual analysis shows that the app store can increase its revenue by 13.6% through a viral marketing policy (e.g., a sharing with friends and family button).
Amortised MAP Inference for Image Super-resolution
Sรธnderby, Casper Kaae, Caballero, Jose, Theis, Lucas, Shi, Wenzhe, Huszรกr, Ferenc
Image super-resolution (SR) is an underdetermined inverse problem, where a large number of plausible high-resolution images can explain the same downsampled image. Most current single image SR methods use empirical risk minimisation, often with a pixel-wise mean squared error (MSE) loss. However, the outputs from such methods tend to be blurry, over-smoothed and generally appear implausible. A more desirable approach would employ Maximum a Posteriori (MAP) inference, preferring solutions that always have a high probability under the image prior, and thus appear more plausible. Direct MAP estimation for SR is non-trivial, as it requires us to build a model for the image prior from samples. Furthermore, MAP inference is often performed via optimisation-based iterative algorithms which don't compare well with the efficiency of neural-network-based alternatives. Here we introduce new methods for amortised MAP inference whereby we calculate the MAP estimate directly using a convolutional neural network. We first introduce a novel neural network architecture that performs a projection to the affine subspace of valid SR solutions ensuring that the high resolution output of the network is always consistent with the low resolution input. We show that, using this architecture, the amortised MAP inference problem reduces to minimising the cross-entropy between two distributions, similar to training generative models. We propose three methods to solve this optimisation problem: (1) Generative Adversarial Networks (GAN) (2) denoiser-guided SR which backpropagates gradient-estimates from denoising to train the network, and (3) a baseline method using a maximum-likelihood-trained image prior. Our experiments show that the GAN based approach performs best on real image data. Lastly, we establish a connection between GANs and amortised variational inference as in e.g. variational autoencoders.
On the Sample Complexity of Learning Graphical Games
We analyze the sample complexity of learning graphical games from purely behavioral data. We assume that we can only observe the players' joint actions and not their payoffs. We analyze the sufficient and necessary number of samples for the correct recovery of the set of pure-strategy Nash equilibria (PSNE) of the true game. Our analysis focuses on directed graphs with $n$ nodes and at most $k$ parents per node. Sparse graphs correspond to ${k \in O(1)}$ with respect to $n$, while dense graphs correspond to ${k \in O(n)}$. By using VC dimension arguments, we show that if the number of samples is greater than ${O(k n \log^2{n})}$ for sparse graphs or ${O(n^2 \log{n})}$ for dense graphs, then maximum likelihood estimation correctly recovers the PSNE with high probability. By using information-theoretic arguments, we show that if the number of samples is less than ${\Omega(k n \log^2{n})}$ for sparse graphs or ${\Omega(n^2 \log{n})}$ for dense graphs, then any conceivable method fails to recover the PSNE with arbitrary probability.
Variational Boosting: Iteratively Refining Posterior Approximations
Miller, Andrew C., Foti, Nicholas, Adams, Ryan P.
We propose a black-box variational inference method to approximate intractable distributions with an increasingly rich approximating class. Our method, termed variational boosting, iteratively refines an existing variational approximation by solving a sequence of optimization problems, allowing the practitioner to trade computation time for accuracy. We show how to expand the variational approximating class by incorporating additional covariance structure and by introducing new components to form a mixture. We apply variational boosting to synthetic and real statistical models, and show that resulting posterior inferences compare favorably to existing posterior approximation algorithms in both accuracy and efficiency.
Approximate Bayes learning of stochastic differential equations
Batz, Philipp, Ruttor, Andreas, Opper, Manfred
Gaussian processes are used as flexible models for these functions and estimates are calculated directly from dense data sets using Gaussian process regression. We also develop an approximate expectation maximization algorithm to deal with the unobserved, latent dynamics between sparse observations. The posterior over states is approximated by a piecewise linearized process of the Ornstein-Uhlenbeck type and the maximum a posteriori estimation of the drift is facilitated by a sparse Gaussian process approximation. I. INTRODUCTION Dynamical systems in the physical world evolve in continuous time and often the (noisy) dynamics is described naturally in terms of (stochastic) differential equations [1]. However, due to missing information and/or the complexity of a system it may be difficult to derive such a model from first principles. Instead, the goal often is to fit it to observations of the state at discrete points in time [2]. So far most inference approaches for these systems have dealt with the estimation of parameters contained in the drift function (e.g. Assumptions for the stochastic part were often simple: additive noise with the diffusion constant as the only parameter to estimate. But as both drift and diffusion can be nonlinear functions of the state vector, a nonparametric estimation would be a natural generalization, when a large number of data points is available. Previous nonparametric approaches were based on solving the adjoint Fokker-Planck equation [5] and on kernel estimators [6] and are effectively restricted to one-dimensional models. An alternative would be a Bayesian nonparametric approach, where prior knowledge on the unknown functions--such as smoothness, variability, or periodicity--can be encoded in a probability distribution. A recent result by [7, 8] presented an important step in this direction. The authors have shown that Gaussian processes (GPs) provide a natural family of prior probability measures over drift functions. If a path of the stochastic dynamics is observed densely, the posterior process over the drift is also a GP. Unfortunately, this simplicity is lost, when observations are not dense, but separated by larger time intervals. In [7] the case of sparse observations has been treated by a Monte Carlo approach, which alternates between sampling complete diffusion paths of the stochastic differential equation (SDE) and sampling from GP for the drift given a philipp.batz@tu-berlin.de
The Mathematics of Machine Learning
In the last few months, I have had several people contact me about their enthusiasm for venturing into the world of data science and using Machine Learning (ML) techniques to probe statistical regularities and build impeccable data-driven products. However, I've observed that some actually lack the necessary mathematical intuition and framework to get useful results. This is the main reason I decided to write this blog post. Recently, there has been an upsurge in the availability of many easy-to-use machine and deep learning packages such as scikit-learn, Weka, Tensorflow etc. Machine Learning theory is a field that intersects statistical, probabilistic, computer science and algorithmic aspects arising from learning iteratively from data and finding hidden insights which can be used to build intelligent applications. Despite the immense possibilities of Machine and Deep Learning, a thorough mathematical understanding of many of these techniques is necessary for a good grasp of the inner workings of the algorithms and getting good results. There are many reasons why the mathematics of Machine Learning is important and I'll highlight some of them below: The main question when trying to understand an interdisciplinary field such as Machine Learning is the amount of maths necessary and the level of maths needed to understand these techniques.