Bayesian Inference
Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly
Kandasamy, Kirthevasan, Vysyaraju, Karun Raju, Neiswanger, Willie, Paria, Biswajit, Collins, Christopher R., Schneider, Jeff, Poczos, Barnabas, Xing, Eric P.
Bayesian Optimisation (BO), refers to a suite of techniques for global optimisation of expensive black box functions, which use introspective Bayesian models of the function to efficiently find the optimum. While BO has been applied successfully in many applications, modern optimisation tasks usher in new challenges where conventional methods fail spectacularly. In this work, we present Dragonfly, an open source Python library for scalable and robust BO. Dragonfly incorporates multiple recently developed methods that allow BO to be applied in challenging real world settings; these include better methods for handling higher dimensional domains, methods for handling multi-fidelity evaluations when cheap approximations of an expensive function are available, methods for optimising over structured combinatorial spaces, such as the space of neural network architectures, and methods for handling parallel evaluations. Additionally, we develop new methodological improvements in BO for selecting the Bayesian model, selecting the acquisition function, and optimising over complex domains with different variable types and additional constraints. We compare Dragonfly to a suite of other packages and algorithms for global optimisation and demonstrate that when the above methods are integrated, they enable significant improvements in the performance of BO. The Dragonfly library is available at dragonfly.github.io.
Modeling Complementary Products and Customer Preferences with Context Knowledge for Online Recommendation
Xu, Da, Ruan, Chuanwei, Korpeoglu, Evren, Kumar, Sushant, Achan, Kannan
Modeling item complementariness and user preferences from purchase data is essential for learning good representations of products and customers, which empowers the modern personalized recommender system for Walmart's e-commerce platform. The intrinsic complementary relationship among products captures the buy-also-buy patterns and provides great sources for recommendations. Product complementary patterns, though often reflected by population purchase behaviors, are not separable from customer-specific bias in purchase data. We propose a unified model with Bayesian network structure that takes account of both factors. In the meantime, we merge the contextual knowledge of both products and customers into their representations. We also use the dual product embeddings to capture the intrinsic properties of complementariness, such as asymmetry. The separating hyperplane theory sheds light on the geometric interpretation of using the additional embedding. We conduct extensive evaluations on our model before final production, and propose a novel ranking criterion based on product and customer embeddings. Our method compares favorably to existing approaches in various offline and online testings, and case studies demonstrate the advantage and usefulness of the dual product embeddings as well as the user embeddings.
Deep Switch Networks for Generating Discrete Data and Language
Delgosha, Payam, Goela, Naveen
Multilayer switch networks are proposed as artificial generators of high-dimensional discrete data (e.g., binary vectors, categorical data, natural language, network log files, and discrete-valued time series). Unlike deconvolution networks which generate continuous-valued data and which consist of upsampling filters and reverse pooling layers, multilayer switch networks are composed of adaptive switches which model conditional distributions of discrete random variables. An interpretable, statistical framework is introduced for training these nonlinear networks based on a maximum-likelihood objective function. To learn network parameters, stochastic gradient descent is applied to the objective. This direct optimization is stable until convergence, and does not involve back-propagation over separate encoder and decoder networks, or adversarial training of dueling networks. While training remains tractable for moderately sized networks, Markov-chain Monte Carlo (MCMC) approximations of gradients are derived for deep networks which contain latent variables. The statistical framework is evaluated on synthetic data, high-dimensional binary data of handwritten digits, and web-crawled natural language data. Aspects of the model's framework such as interpretability, computational complexity, and generalization ability are discussed.
Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret
Calandriello, Daniele, Carratino, Luigi, Lazaric, Alessandro, Valko, Michal, Rosasco, Lorenzo
Gaussian processes (GP) are a popular Bayesian approach for the optimization of black-box functions. Despite their effectiveness in simple problems, GP-based algorithms hardly scale to complex high-dimensional functions, as their per-iteration time and space cost is at least quadratic in the number of dimensions $d$ and iterations $t$. Given a set of $A$ alternative to choose from, the overall runtime $O(t^3A)$ quickly becomes prohibitive. In this paper, we introduce BKB (budgeted kernelized bandit), a novel approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret (and hence near-optimal convergence rate) with near-constant per-iteration complexity and no assumption on the input space or covariance of the GP. Combining a kernelized linear bandit algorithm (GP-UCB) with randomized matrix sketching technique (i.e., leverage score sampling), we prove that selecting inducing points based on their posterior variance gives an accurate low-rank approximation of the GP, preserving variance estimates and confidence intervals. As a consequence, BKB does not suffer from variance starvation, an important problem faced by many previous sparse GP approximations. Moreover, we show that our procedure selects at most $\tilde{O}(d_{eff})$ points, where $d_{eff}$ is the effective dimension of the explored space, which is typically much smaller than both $d$ and $t$. This greatly reduces the dimensionality of the problem, thus leading to a $O(TAd_{eff}^2)$ runtime and $O(A d_{eff})$ space complexity.
A Multi-armed Bandit MCMC, with applications in sampling from doubly intractable posterior
Markov chain Monte Carlo (MCMC) algorithms are widely used to sample from complicated distributions, especially to sample from the posterior distribution in Bayesian inference. However, MCMC is not directly applicable when facing the doubly intractable problem. In this paper, we discussed and compared two existing solutions -- Pseudo-marginal Monte Carlo and Exchange Algorithm. This paper also proposes a novel algorithm: Multi-armed Bandit MCMC (MABMC), which chooses between two (or more) randomized acceptance ratios in each step. MABMC could be applied directly to incorporate Pseudo-marginal Monte Carlo and Exchange algorithm, with higher average acceptance probability.
Transmission Matrix Inference via Pseudolikelihood Decimation
One of the biggest challenges in the field of biomedical imaging is the comprehension and the exploitation of the photon scattering through disordered media. Many studies have pursued the solution to this puzzle, achieving light-focusing control or reconstructing images in complex media. In the present work, we investigate how statistical inference helps the calculation of the transmission matrix in a complex scrambling environment, enabling its usage like a normal optical element. We convert a linear input-output transmission problem into a statistical formulation based on pseudolikelihood maximization, learning the coupling matrix via random sampling of intensity realizations. Our aim is to uncover insights from the scattering problem, encouraging the development of novel imaging techniques for better medical investigations, borrowing a number of statistical tools from spin-glass theory.
GASC: Genre-Aware Semantic Change for Ancient Greek
Perrone, Valerio, Palma, Marco, Hengchen, Simon, Vatri, Alessandro, Smith, Jim Q., McGillivray, Barbara
Word meaning changes over time, depending on linguistic and extra-linguistic factors. Associating a word's correct meaning in its historical context is a critical challenge in diachronic research, and is relevant to a range of NLP tasks, including information retrieval and semantic search in historical texts. Bayesian models for semantic change have emerged as a powerful tool to address this challenge, providing explicit and interpretable representations of semantic change phenomena. However, while corpora typically come with rich metadata, existing models are limited by their inability to exploit contextual information (such as text genre) beyond the document time-stamp. This is particularly critical in the case of ancient languages, where lack of data and long diachronic span make it harder to draw a clear distinction between polysemy and semantic change, and current systems perform poorly on these languages. We develop GASC, a dynamic semantic change model that leverages categorical metadata about the texts' genre information to boost inference and uncover the evolution of meanings in Ancient Greek corpora. In a new evaluation framework, we show that our model achieves improved predictive performance compared to the state of the art.
Financial Applications of Gaussian Processes and Bayesian Optimization
Gonzalvez, Joan, Lezmi, Edmond, Roncalli, Thierry, Xu, Jiali
In the last five years, the financial industry has been impacted by the emergence of digitalization and machine learning. In this article, we explore two methods that have undergone rapid development in recent years: Gaussian processes and Bayesian optimization. Gaussian processes can be seen as a generalization of Gaussian random vectors and are associated with the development of kernel methods. Bayesian optimization is an approach for performing derivative-free global optimization in a small dimension, and uses Gaussian processes to locate the global maximum of a black-box function. The first part of the article reviews these two tools and shows how they are connected. In particular, we focus on the Gaussian process regression, which is the core of Bayesian machine learning, and the issue of hyperparameter selection. The second part is dedicated to two financial applications. We first consider the modeling of the term structure of interest rates. More precisely, we test the fitting method and compare the GP prediction and the random walk model. The second application is the construction of trend-following strategies, in particular the online estimation of trend and covariance windows.
Elements of Sequential Monte Carlo
Naesseth, Christian A., Lindsten, Fredrik, Schön, Thomas B.
A core problem in statistics and probabilistic machine learning is to compute probability distributions and expectations. This is the fundamental problem of Bayesian statistics and machine learning, which frames all inference as expectations with respect to the posterior distribution. The key challenge is to approximate these intractable expectations. In this tutorial, we review sequential Monte Carlo (SMC), a random-sampling-based class of methods for approximate inference. First, we explain the basics of SMC, discuss practical issues, and review theoretical results. We then examine two of the main user design choices: the proposal distributions and the so called intermediate target distributions. We review recent results on how variational inference and amortization can be used to learn efficient proposals and target distributions. Next, we discuss the SMC estimate of the normalizing constant, how this can be used for pseudo-marginal inference and inference evaluation. Throughout the tutorial we illustrate the use of SMC on various models commonly used in machine learning, such as stochastic recurrent neural networks, probabilistic graphical models, and probabilistic programs.
Understanding Agent Incentives using Causal Influence Diagrams. Part I: Single Action Settings
Everitt, Tom, Ortega, Pedro A., Barnes, Elizabeth, Legg, Shane
Agents are systems that optimize an objective function in an environment. Together, the goal and the environment induce secondary objectives, incentives. Modeling the agent-environment interaction in graphical models called influence diagrams, we can answer two fundamental questions about an agent's incentives directly from the graph: (1) which nodes is the agent incentivized to observe, and (2) which nodes is the agent incentivized to influence? The answers tell us which information and influence points need extra protection. For example, we may want a classifier for job applications to not use the ethnicity of the candidate, and a reinforcement learning agent not to take direct control of its reward mechanism. Different algorithms and training paradigms can lead to different influence diagrams, so our method can be used to identify algorithms with problematic incentives and help in designing algorithms with better incentives.