Exploring Offline Policy Evaluation for the Continuous-Armed Bandit Problem

Kruijswijk, Jules, Parvinen, Petri, Kaptein, Maurits

Aug-21-2019–arXiv.org Machine Learning

In the canonical multi-armed bandit (MAB) problem a gambler stands in front of a row of slot machines, each with a (potentially) different payoff. It is up to the gambler to decide in sequence which machine to play and, during the course of sequentially playing the machines, she aims to make as much profit as possible by simultaneously learning from the previous observations and using the gained knowledge to steer future actions (Berry and Fristedt, 1985; Whittle, 1980). The gambler needs to pick a strategy that dictates which arm to play next given the previous observations. The problem of finding such a strategy is complicated since at each interaction the gambler only observes the outcomes of the machine she played, and she will never know the outcomes of the other possible courses of action at that moment in time. This so-called omission of counterfactuals (Li, Chu, Langford, and Wang, 2011) - not being able to gain knowledge about all the possible outcomes - gives rise to the exploration versus exploitation tradeoff (Berry and Fristedt, 1985): at each time point an action can either be geared at gaining more knowledge regarding the machines she is uncertain about (exploration), or it can be geared at using the knowledge gained in earlier interactions by playing machines with a high expected payoff (exploitation).

artificial intelligence, data mining, machine learning, (20 more...)

arXiv.org Machine Learning

Aug-21-2019

arXiv.org PDF

Add feedback

Genre:
- Research Report > Experimental Study (0.46)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Big Data (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Machine Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found