Collaborative Min-Max Regret in Grouped Multi-Armed Bandits

Blanchard, Moïse, Goyal, Vineet

arXiv.org Machine Learning 

This model has wide-ranging applications including recommender systems that sequentially offer products recommendations to users to maximize long-term revenue [LCLS10], or clinical trials in which the goal is to find the best treatment for a sequence of patients [Tho33]. A key aspect of multi-armed bandits is that to maximize long-term reward, one must balance between exploration (acquiring more information about suboptimal actions to potentially improve future decisions) and exploitation (following the best actions given the current information). Exploration naturally comes at a cost for users, which can disproportionately impact certain groups of users. This raises important questions about when and how the exploration burden can be alleviated for these groups [RSWW18, JKL20]. Notably, [BF24] showed that in the asymptotic regime, the exploration cost can be shared in an arbitrarily unfair manner between groups for standard learning policies and proposed a Nash-bargaining solution to alleviate this issue. As an equivalent perspective on the problem, we can consider a setting in which several agents face their own multi-armed bandit problem, e.g., groups of recommenders targeting different populations. The goal is then to understand when and how collaborative exploration can be beneficial as opposed to each group solving their problem individually without sharing information. We focus on heterogeneity between agents or groups in terms of their set of available actions, which corresponds to the so-called grouped bandit setting [BF24].

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found