Collaborative Min-Max Regret in Grouped Multi-Armed Bandits
Blanchard, Moïse, Goyal, Vineet
This model has wide-ranging applications including recommender systems that sequentially offer products recommendations to users to maximize long-term revenue [LCLS10], or clinical trials in which the goal is to find the best treatment for a sequence of patients [Tho33]. A key aspect of multi-armed bandits is that to maximize long-term reward, one must balance between exploration (acquiring more information about suboptimal actions to potentially improve future decisions) and exploitation (following the best actions given the current information). Exploration naturally comes at a cost for users, which can disproportionately impact certain groups of users. This raises important questions about when and how the exploration burden can be alleviated for these groups [RSWW18, JKL20]. Notably, [BF24] showed that in the asymptotic regime, the exploration cost can be shared in an arbitrarily unfair manner between groups for standard learning policies and proposed a Nash-bargaining solution to alleviate this issue. As an equivalent perspective on the problem, we can consider a setting in which several agents face their own multi-armed bandit problem, e.g., groups of recommenders targeting different populations. The goal is then to understand when and how collaborative exploration can be beneficial as opposed to each group solving their problem individually without sharing information. We focus on heterogeneity between agents or groups in terms of their set of available actions, which corresponds to the so-called grouped bandit setting [BF24].
Jun-13-2025
- Country:
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- North America > United States
- Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom
- Genre:
- Research Report (0.90)
- Industry:
- Technology: