Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

Oct-23-2020–arXiv.org Machine Learning

As an instance of sequential decision-making problems, the multi-armed bandit (MAB) algorithms have attracted significant attention in various applications, such as ad optimization, personalized medicine, search engines, and recommendation systems. Recently, various methods for evaluating a new policy using historical data obtained via the MAB algorithms (Beygelzimer & Langford, 2009; Li et al., 2010) have emerged. The goal of off-policy evaluation (OPE) is to evaluate a new policy by estimating the expected reward obtained from the new policy (Dudík et al., 2011; Wang et al., 2017; Narita et al., 2019; Bibaut et al., 2019; Kallus & Uehara, 2019; Oberst & Sontag, 2019). Although an OPE algorithm estimates the expected reward from a new policy, most existing studies presume that the samples are independent and identically distributed (i.i.d.). However, the MAB algorithm policy updates the probability of choosing an action based on past observations, and samples are not i.i.d.

artificial intelligence, data mining, machine learning, (12 more...)

arXiv.org Machine Learning

Oct-23-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.14)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine (0.48)

Technology:
- Information Technology
  - Data Science > Data Mining
    - Big Data (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Machine Learning > Statistical Learning
      - Regression (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found