We study an online decision making problem where on each round a learner chooses a list of items based on some side information, receives a scalar feedback value for each individual item, and a reward that is linearly related to this feedback. These problems, known as contextual semibandits, arise in crowdsourcing, recommendation, and many other domains. This paper reduces contextual semibandits to supervised learning, allowing us to leverage powerful supervised learning methods in this partial-feedback setting. Our first reduction applies when the mapping from feedback to reward is known and leads to a computationally efficient algorithm with near-optimal regret. We show that this algorithm outperforms state-of-the-art approaches on real-world learning-to-rank datasets, demonstrating the advantage of oracle-based algorithms. Our second reduction applies to the previously unstudied setting when the linear mapping from feedback to reward is unknown. Our regret guarantees are superior to prior techniques that ignore the feedback.
Wang, Yingfei (Princeton University) | Ouyang, Hua (Apple Inc.) | Wang, Chu (Nokia Bell Labs) | Chen, Jianhui (Yahoo Research) | Asamov, Tsvetan (Princeton University) | Chang, Yi (Huawei Research America)
Multi-Armed Bandit (MAB) framework has been successfully applied in many web applications. However, many complex real-world applications that involve multiple content recommendations cannot fit into the traditional MAB setting. To address this issue, we consider an ordered combinatorial semi-bandit problem where the learner recommends S actions from a base set of K actions, and displays the results in S (out of M ) different positions. The aim is to maximize the cumulative reward with respect to the best possible subset and positions in hindsight. By the adaptation of a minimum-cost maximum-flow network, a practical algorithm based on Thompson sampling is derived for the (contextual) combinatorial problem, thus resolving the problem of computational intractability.With its potential to work with whole-page recommendation and any probabilistic models, to illustrate the effectiveness of our method, we focus on Gaussian process optimization and a contextual setting where click-through rate is predicted using logistic regression. We demonstrate the algorithms’ performance on synthetic Gaussian process problems and on large-scale news article recommendation datasets from Yahoo! Front Page Today Module.
Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a review of the more advanced results. The chapters are as follows: Stochastic bandits; Lower bounds; Bayesian Bandits and Thompson Sampling; Lipschitz Bandits; Full Feedback and Adversarial Costs; Adversarial Bandits; Linear Costs and Semi-bandits; Contextual Bandits; Bandits and Zero-Sum Games; Bandits with Knapsacks; Incentivized Exploration and Connections to Mechanism Design.
Contextual bandit algorithms are sensitive to the estimation method of the outcome model as well as the exploration method used, particularly in the presence of rich heterogeneity or complex outcome models, which can lead to difficult estimation problems along the path of learning. We develop algorithms for contextual bandits with linear payoffs that integrate balancing methods from the causal inference literature in their estimation to make it less prone to problems of estimation bias. We provide the first regret bound analyses for linear contextual bandits with balancing and show that our algorithms match the state of the art theoretical guarantees. We demonstrate the strong practical advantage of balanced contextual bandits on a large number of supervised learning datasets and on a synthetic example that simulates model misspecification and prejudice in the initial training data.
This paper studies the evaluation of policies that recommend an ordered set of items (e.g., a ranking) based on some context---a common scenario in web search, ads, and recommendation. We build on techniques from combinatorial bandits to introduce a new practical estimator that uses logged data to estimate a policy's performance. A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance. We derive conditions under which our estimator is unbiased---these conditions are weaker than prior heuristics for slate evaluation---and experimentally demonstrate a smaller bias than parametric approaches, even when these conditions are violated. Finally, our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators.