Multitask Bandit Learning through Heterogeneous Feedback Aggregation

Wang, Zhi, Zhang, Chicheng, Singh, Manish Kumar, Riek, Laurel D., Chaudhuri, Kamalika

arXiv.org Machine Learning 

Online multi-armed bandit learning has many important real-world applications (see Villar et al., 2015; Shen et al., 2015; Li et al., 2010, for a few examples). In practice, a group of online bandit learning agents are often deployed for similar tasks, and they learn to perform these tasks in similar yet nonidentical environments. For example, a group of assistive healthcare robots may be deployed to provide personalized cognitive training to people with dementia (PwD), e.g., by playing cognitive training games with people (Kubota et al., 2020). Each robot seeks to learn the preferences of its paired PwD so as to recommend tailored health intervention based on how the PwD reacts to and is engaged with the activities (as captured by sensors on the robots) (Kubota et al., 2020). As PwD may have similar preferences and may therefore exhibit similar reactions, one natural question arises--can the robots as a multi-agent system learn to perform their respective tasks faster through collaboration? In this paper, we develop multi-agent bandit learning algorithms where each agent can robustly aggregate data from other agents to better perform its respective task. We generalize the the multi-armed bandit problem (Auer et al., 2002) and formulate the ɛ-Multi-Player Multi-Armed Bandit (ɛ-MPMAB) problem, which models heterogeneous multitask learning in a multi-agent bandit learning setting. In an ɛ-MPMAB problem instance, a set of M players are deployed to perform similar tasks--simultaneously they interact with a set of actions/arms, and for each arm, different players receive feedback from similar but not necessarily identical reward distributions. In the above assistive robotics example, each player corresponds to a robot; each arm corresponds to one of the cognitive activities to choose from; for each player and each arm, there is a separate reward distribution which reflects a PwD's

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found