Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under Batch Update Policy

Kato, Masahiro, Kaneko, Yusuke

arXiv.org Machine Learning 

As an instance of sequential decision-making problems, the multi-armed bandit (MAB) algorithms have attracted significant attention in various applications, such as ad optimization, personalized medicine, search engines, and recommendation systems. Recently, various methods for evaluating a new policy using historical data obtained via the MAB algorithms (Beygelzimer & Langford, 2009; Li et al., 2010) have emerged. The goal of off-policy evaluation (OPE) is to evaluate a new policy by estimating the expected reward obtained from the new policy (Dudík et al., 2011; Wang et al., 2017; Narita et al., 2019; Bibaut et al., 2019; Kallus & Uehara, 2019; Oberst & Sontag, 2019). Although an OPE algorithm estimates the expected reward from a new policy, most existing studies presume that the samples are independent and identically distributed (i.i.d.). However, the MAB algorithm policy updates the probability of choosing an action based on past observations, and samples are not i.i.d.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found