Online learning in bandits with predicted context
Guo, Yongyi, Xu, Ziping, Murphy, Susan
Contextual bandits (Auer, 2002; Langford and Zhang, 2007) represent a classical sequential decisionmaking problem where an agent aims to maximize cumulative reward based on context information. At each round t, the agent observes a context and must choose one of K available actions based on both the current context and previous observations. Once the agent selects an action, she observes the associated reward, which is then used to refine future decision-making. Contextual bandits are typical examples of reinforcement learning problems where a balance between exploring new actions and exploiting previously acquired information is necessary to achieve optimal long-term rewards. It has numerous real-world applications including personalized recommendation systems (Li et al., 2010; Bouneffouf et al., 2012), healthcare (Yom-Tov et al., 2017; Liao et al., 2020), and online education (Liu et al., 2014; Shaikh et al., 2019). Despite the extensive existing literature on contextual bandits, in many real-world applications, the agent never observes the context exactly.
Oct-31-2023
- Country:
- North America > United States > Wisconsin > Dane County > Madison (0.14)
- Genre:
- Research Report > Experimental Study (0.46)
- Industry:
- Education > Educational Setting
- Online (1.00)
- Health & Medicine (1.00)
- Education > Educational Setting