Bayesian Inference
Hyperparameter Learning for Conditional Kernel Mean Embeddings with Rademacher Complexity Bounds
Hsu, Kelvin, Nock, Richard, Ramos, Fabio
Conditional kernel mean embeddings are nonparametric models that encode conditional expectations in a reproducing kernel Hilbert space. While they provide a flexible and powerful framework for probabilistic inference, their performance is highly dependent on the choice of kernel and regularization hyperparameters. Nevertheless, current hyperparameter tuning methods predominantly rely on expensive cross validation or heuristics that is not optimized for the inference task. For conditional kernel mean embeddings with categorical targets and arbitrary inputs, we propose a hyperparameter learning framework based on Rademacher complexity bounds to prevent overfitting by balancing data fit against model complexity. Our approach only requires batch updates, allowing scalable kernel hyperparameter tuning without invoking kernel approximations. Experiments demonstrate that our learning framework outperforms competing methods, and can be further extended to incorporate and learn deep neural network weights to improve generalization.
Computing the Value of Computation for Planning
An intelligent agent performs actions in order to achieve its goals. Such actions can either be externally directed, such as opening a door, or internally directed, such as writing data to a memory location or strengthening a synaptic connection. Some internal actions, to which we refer as computations, potentially help the agent choose better actions. Considering that (external) actions and computations might draw upon the same resources, such as time and energy, deciding when to act or compute, as well as what to compute, are detrimental to the performance of an agent. In an environment that provides rewards depending on an agent's behavior, an action's value is typically defined as the sum of expected long-term rewards succeeding the action (itself a complex quantity that depends on what the agent goes on to do after the action in question). However, defining the value of a computation is not as straightforward, as computations are only valuable in a higher order way, through the alteration of actions. This thesis offers a principled way of computing the value of a computation in a planning setting formalized as a Markov decision process. We present two different definitions of computation values: static and dynamic. They address two extreme cases of the computation budget: affording calculation of zero or infinitely many steps in the future. We show that these values have desirable properties, such as temporal consistency and asymptotic convergence. Furthermore, we propose methods for efficiently computing and approximating the static and dynamic computation values. We describe a sense in which the policies that greedily maximize these values can be optimal. We utilize these principles to construct Monte Carlo tree search algorithms that outperform most of the state-of-the-art in terms of finding higher quality actions given the same simulation resources.
Multi-channel discourse as an indicator for Bitcoin price and volume movements
This research aims to identify how Bitcoin-related news publications and online discourse are expressed in Bitcoin exchange movements of price and volume. Being inherently digital, all Bitcoin-related fundamental data (from exchanges, as well as transactional data directly from the blockchain) is available online, something that is not true for traditional businesses or currencies traded on exchanges. This makes Bitcoin an interesting subject for such research, as it enables the mapping of sentiment to fundamental events that might otherwise be inaccessible. Furthermore, Bitcoin discussion largely takes place on online forums and chat channels. In stock trading, the value of sentiment data in trading decisions has been demonstrated numerous times [1] [2] [3], and this research aims to determine whether there is value in such data for Bitcoin trading models. To achieve this, data over the year 2015 has been collected from Bitcointalk.org, (the biggest Bitcoin forum in post volume), established news sources such as Bloomberg and the Wall Street Journal, the complete /r/btc and /r/Bitcoin subreddits, and the bitcoin-otc and bitcoin-dev IRC channels. By analyzing this data on sentiment and volume, we find weak to moderate correlations between forum, news, and Reddit sentiment and movements in price and volume from 1 to 5 days after the sentiment was expressed. A Granger causality test confirms the predictive causality of the sentiment on the daily percentage price and volume movements, and at the same time underscores the predictive causality of market movements on sentiment expressions in online communities
Distributionally Robust Graphical Models
Fathony, Rizal, Rezaei, Ashkan, Bashiri, Mohammad Ali, Zhang, Xinhua, Ziebart, Brian D.
In many structured prediction problems, complex relationships between variables are compactly defined using graphical structures. The most prevalent graphical prediction methods---probabilistic graphical models and large margin methods---have their own distinct strengths but also possess significant drawbacks. Conditional random fields (CRFs) are Fisher consistent, but they do not permit integration of customized loss metrics into their learning process. Large-margin models, such as structured support vector machines (SSVMs), have the flexibility to incorporate customized loss metrics, but lack Fisher consistency guarantees. We present adversarial graphical models (AGM), a distributionally robust approach for constructing a predictor that performs robustly for a class of data distributions defined using a graphical structure. Our approach enjoys both the flexibility of incorporating customized loss metrics into its design as well as the statistical guarantee of Fisher consistency. We present exact learning and prediction algorithms for AGM with time complexity similar to existing graphical models and show the practical benefits of our approach with experiments.
Concept Learning with Energy-Based Models
Many hallmarks of human intelligence, such as generalizing from limited experience, abstract reasoning and planning, analogical reasoning, creative problem solving, and capacity for language require the ability to consolidate experience into concepts, which act as basic building blocks of understanding and reasoning. We present a framework that defines a concept by an energy function over events in the environment, as well as an attention mask over entities participating in the event. Given few demonstration events, our method uses inference-time optimization procedure to generate events involving similar concepts or identify entities involved in the concept. We evaluate our framework on learning visual, quantitative, relational, temporal concepts from demonstration events in an unsupervised manner. Our approach is able to successfully generate and identify concepts in a few-shot setting and resulting learned concepts can be reused across environments. Example videos of our results are available at sites.google.com/site/energyconceptmodels
TzK Flow - Conditional Generative Model
We introduce TzK (pronounced "task"), a conditional flow-based encoder/decoder generative model, formulated in terms of maximum likelihood (ML). TzK offers efficient approximation of arbitrary data sample distributions (similar to GAN and flow-based ML), and stable training (similar to VAE and ML), while avoiding variational approximations (similar to ML). TzK exploits meta-data to facilitate a bottleneck, similar to autoencoders, thereby producing a low-dimensional representation. Unlike autoencoders, our bottleneck does not limit model expressiveness, similar to flow-based ML. Supervised, unsupervised, and semi-supervised learning are supported by replacing missing observations with samples from learned priors. We demonstrate TzK by jointly training on MNIST and Omniglot with minimal preprocessing, and weak supervision, with results which are comparable to state-of-the-art.
Low-Rank Phase Retrieval via Variational Bayesian Learning
Liu, Kaihui, Wang, Jiayi, Xing, Zhengli, Yang, Linxiao, Fang, Jun
Abstract--In this paper, we consider the problem of low-rank phase retrieval whose objective is to estimate a complex low-rank matrix from magnitude-only measurements. We propose a hierarchical prior model for low-rank phase retrieval, in which a Gaussian-Wishart hierarchical prior is placed on the underlying low-rank matrix to promote the low-rankness of the matrix. Based on the proposed hierarchical model, a variational expectation-maximization (EM) algorithm is developed. The proposed method is less sensitive to the choice of the initialization point and works well with random initialization. Simulation results are provided to illustrate the effectiveness of the proposed algorithm.
Bayesian Action Decoder for Deep Multi-Agent Reinforcement Learning
Foerster, Jakob N., Song, Francis, Hughes, Edward, Burch, Neil, Dunning, Iain, Whiteson, Shimon, Botvinick, Matthew, Bowling, Michael
When observing the actions of others, humans carry out inferences about why the others acted as they did, and what this implies about their view of the world. Humans also use the fact that their actions will be interpreted in this manner when observed by others, allowing them to act informatively and thereby communicate efficiently with others. Although learning algorithms have recently achieved superhuman performance in a number of two-player, zero-sum games, scalable multi-agent reinforcement learning algorithms that can discover effective strategies and conventions in complex, partially observable settings have proven elusive. We present the Bayesian action decoder (BAD), a new multi-agent learning method that uses an approximate Bayesian update to obtain a public belief that conditions on the actions taken by all agents in the environment. Together with the public belief, this Bayesian update effectively defines a new Markov decision process, the public belief MDP, in which the action space consists of deterministic partial policies, parameterised by deep neural networks, that can be sampled for a given public state. It exploits the fact that an agent acting only on this public belief state can still learn to use its private information if the action space is augmented to be over partial policies mapping private information into environment actions. The Bayesian update is also closely related to the theory of mind reasoning that humans carry out when observing others' actions. We first validate BAD on a proof-of-principle two-step matrix game, where it outperforms traditional policy gradient methods. We then evaluate BAD on the challenging, cooperative partial-information card game Hanabi, where in the two-player setting the method surpasses all previously published learning and hand-coded approaches.
Variational Bayes Inference in Digital Receivers
The digital telecommunications receiver is an important context for inference methodology, the key objective being to minimize the expected loss function in recovering the transmitted information. For that criterion, the optimal decision is the Bayesian minimum-risk estimator. However, the computational load of the Bayesian estimator is often prohibitive and, hence, efficient computational schemes are required. The design of novel schemes, striking new balances between accuracy and computational load, is the primary concern of this thesis. Two popular techniques, one exact and one approximate, will be studied. The exact scheme is a recursive one, namely the generalized distributive law (GDL), whose purpose is to distribute all operators across the conditionally independent (CI) factors of the joint model, so as to reduce the total number of operators required. In a novel theorem derived in this thesis, GDL, if applicable, will be shown to guarantee such a reduction in all cases. An associated lemma also quantifies this reduction. For practical use, two novel algorithms, namely the no-longer-needed (NLN) algorithm and the generalized form of the Markovian Forward-Backward (FB) algorithm, recursively factorizes and computes the CI factors of an arbitrary model, respectively. The approximate scheme is an iterative one, namely the Variational Bayes (VB) approximation, whose purpose is to find the independent (i.e. zero-order Markov) model closest to the true joint model in the minimum Kullback-Leibler divergence (KLD) sense. Despite being computationally efficient, this naive mean field approximation confers only modest performance for highly correlated models. A novel approximation, namely Transformed Variational Bayes (TVB), will be designed in the thesis in order to relax the zero-order constraint in the VB approximation, further reducing the KLD of the optimal approximation.
Large-scale Heteroscedastic Regression via Gaussian Process
Liu, Haitao, Ong, Yew-Soon, Cai, Jianfei
Heteroscedastic regression which considers varying noises across input domain has many applications in fields like machine learning and statistics. Here we focus on the heteroscedastic Gaussian process (HGP) regression which integrates the latent function and the noise together in a unified non-parametric Bayesian framework. Though showing flexible and powerful performance, HGP suffers from the cubic time complexity, which strictly limits its application to big data. To improve the scalability of HGP, we first develop a variational sparse inference algorithm, named VSHGP, to handle large-scale datasets. Furthermore, to enhance the model capability of capturing quick-varying features, we follow the Bayesian committee machine (BCM) formalism to distribute the learning over multiple local VSHGP experts with many inducing points, and aggregate their predictive distributions. The proposed distributed VSHGP (DVSHGP) (i) enables large-scale HGP regression via distributed computations, and (ii) achieves high model capability via localized experts and many inducing points. Superiority of the proposed DVSHGP as compared to existing large-scale heteroscedastic/homoscedastic GPs is then verified using a synthetic dataset and three real-world datasets.