Online learning in bandits with predicted context

Guo, Yongyi, Xu, Ziping, Murphy, Susan

arXiv.org Machine Learning 

Contextual bandits (Auer, 2002; Langford and Zhang, 2007) represent a classical sequential decisionmaking problem where an agent aims to maximize cumulative reward based on context information. At each round t, the agent observes a context and must choose one of K available actions based on both the current context and previous observations. Once the agent selects an action, she observes the associated reward, which is then used to refine future decision-making. Contextual bandits are typical examples of reinforcement learning problems where a balance between exploring new actions and exploiting previously acquired information is necessary to achieve optimal long-term rewards. It has numerous real-world applications including personalized recommendation systems (Li et al., 2010; Bouneffouf et al., 2012), healthcare (Yom-Tov et al., 2017; Liao et al., 2020), and online education (Liu et al., 2014; Shaikh et al., 2019). Despite the extensive existing literature on contextual bandits, in many real-world applications, the agent never observes the context exactly.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found