Exploring Offline Policy Evaluation for the Continuous-Armed Bandit Problem

Kruijswijk, Jules, Parvinen, Petri, Kaptein, Maurits

arXiv.org Machine Learning 

In the canonical multi-armed bandit (MAB) problem a gambler stands in front of a row of slot machines, each with a (potentially) different payoff. It is up to the gambler to decide in sequence which machine to play and, during the course of sequentially playing the machines, she aims to make as much profit as possible by simultaneously learning from the previous observations and using the gained knowledge to steer future actions (Berry and Fristedt, 1985; Whittle, 1980). The gambler needs to pick a strategy that dictates which arm to play next given the previous observations. The problem of finding such a strategy is complicated since at each interaction the gambler only observes the outcomes of the machine she played, and she will never know the outcomes of the other possible courses of action at that moment in time. This so-called omission of counterfactuals (Li, Chu, Langford, and Wang, 2011) - not being able to gain knowledge about all the possible outcomes - gives rise to the exploration versus exploitation tradeoff (Berry and Fristedt, 1985): at each time point an action can either be geared at gaining more knowledge regarding the machines she is uncertain about (exploration), or it can be geared at using the knowledge gained in earlier interactions by playing machines with a high expected payoff (exploitation).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found