Contextual Thompson Sampling via Generation of Missing Data

Zhang, Kelly W., Cai, Tiffany Tianhui, Namkoong, Hongseok, Russo, Daniel

arXiv.org Machine Learning 

Recent advances in machine learning have transformed our ability to develop high quality predictive and generative models for complex data. This work introduces a framework for developing decisionmaking algorithms, specifically for contextual bandit problems, that can take advantage of these machine learning advances. By design, we assume the algorithm developer is able to effectively apply these machine learning techniques (e.g., minimize a loss via gradient descent) and employ these methods as subroutines in our decision-making algorithm. Moreover, our theory formally connects the quality of effective (self-)supervised learning via loss minimization to the quality of decision-making. Classically, contextual Thompson sampling algorithms form a parametric model of the environment and consider the decision-maker's uncertainty as arising from unknown latent parameters of that model [51]. In this classical perspective, the primitive operations that are assumed to be feasible (at least approximately) include i) the ability to specify an informative prior for the latent parameter using domain knowledge, ii) the ability to sample from the posterior distribution of the latent parameter, and iii) the ability to update the posterior distribution as more data is collected. Unfortunately, it is well known that all three of these primitive operations are non-trivial to perform with neural networks [20, 61]. Building on our previous work [8] which focuses on multi-armed bandits without contexts, we view missing, but potentially observable, future outcomes as the source of the decision-maker's uncertainty. This perspective allows us to replace the primitive operations required in the classical view with new primitives that are more compatible with neural networks: i) the ability to effectively minimize an offline sequence prediction loss, ii) the ability to autoregressively generate from the optimized sequence model, and iii) the ability to fit a desired policy given access to a complete dataset (outcomes from all actions and decision-times).