Dumitrascu, Bianca, Feng, Karen, Engelhardt, Barbara E

We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards. Using a fast inference procedure with Polya-Gamma distributed augmentation variables, we propose an improved version of Thompson Sampling, a Bayesian formulation of contextual bandits with near-optimal performance. Our approach, Polya-Gamma augmented Thompson Sampling (PG-TS), achieves state-of-the-art performance on simulated and real data. PG-TS explores the action space efficiently and exploits high-reward arms, quickly converging to solutions of low regret. Its explicit estimation of the posterior distribution of the context feature covariance leads to substantial empirical gains over approximate approaches. PG-TS is the first approach to demonstrate the benefits of Polya-Gamma augmentation in bandits and to propose an efficient Gibbs sampler for approximating the analytically unsolvable integral of logistic contextual bandits.

Dumitrascu, Bianca, Feng, Karen, Engelhardt, Barbara

We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards. Using a fast inference procedure with Pólya-Gamma distributed augmentation variables, we propose an improved version of Thompson Sampling, a Bayesian formulation of contextual bandits with near-optimal performance. Our approach, Pólya-Gamma augmented Thompson Sampling (PG-TS), achieves state-of-the-art performance on simulated and real data. PG-TS explores the action space efficiently and exploits high-reward arms, quickly converging to solutions of low regret. Its explicit estimation of the posterior distribution of the context feature covariance leads to substantial empirical gains over approximate approaches. PG-TS is the first approach to demonstrate the benefits of Pólya-Gamma augmentation in bandits and to propose an efficient Gibbs sampler for approximating the analytically unsolvable integral of logistic contextual bandits.

Dumitrascu, Bianca, Feng, Karen, Engelhardt, Barbara

Colin, Igor, Thomas, Albert, Draief, Moez

Abstract--As cellular networks become denser, a scalable and dynamic tuning of wireless base station parameters can only be achieved through automated optimization. Although the contextual banditframework arises as a natural candidate for such a task, its extension to a parallel setting is not straightforward: one needs to carefully adapt existing methods to fully leverage the multi-agent structure of this problem. We propose two approaches: one derived from a deterministic UCB-like method and the other relying on Thompson sampling. Thanks to its bayesian nature, the latter is intuited to better preserve the exploration-exploitation balance in the bandit batch. This is verified on toy experiments, where Thompson sampling shows robustness to the variability of the contexts. Finally, we apply both methods on a real base station network dataset and evidence that Thompson sampling outperforms both manual tuning and contextual UCB. I. INTRODUCTION The land area covered by a cellular wireless network, such as a mobile phone network, is divided into small areas called cells, each cell being covered by the antenna of a fixed base station (see Figure 1).

Krishnamurthy, Akshay, Agarwal, Alekh, Dudik, Miro

We study an online decision making problem where on each round a learner chooses a list of items based on some side information, receives a scalar feedback value for each individual item, and a reward that is linearly related to this feedback. These problems, known as contextual semibandits, arise in crowdsourcing, recommendation, and many other domains. This paper reduces contextual semibandits to supervised learning, allowing us to leverage powerful supervised learning methods in this partial-feedback setting. Our first reduction applies when the mapping from feedback to reward is known and leads to a computationally efficient algorithm with near-optimal regret. We show that this algorithm outperforms state-of-the-art approaches on real-world learning-to-rank datasets, demonstrating the advantage of oracle-based algorithms. Our second reduction applies to the previously unstudied setting when the linear mapping from feedback to reward is unknown. Our regret guarantees are superior to prior techniques that ignore the feedback.