On the optimal regret of collaborative personalized linear bandits
Huang, Bruce, Zhou, Ruida, Yang, Lin F., Diggavi, Suhas
–arXiv.org Artificial Intelligence
Stochastic linear bandits are a fundamental model in sequential decision-making, where in each round, an agent selects a vector-valued action and receives a stochastic reward with expectation equal to the inner product between the action and an unknown parameter vector [14]. This setting has been extensively studied in the single-agent case, where the goal is to maximize cumulative reward over time through efficient exploration and exploitation [19, 12, 1]. In this paper, we consider a generalized multi-agent setting, motivated by many practical scenarios such as distributed decision systems [13, 24], personalized adaptive interfaces [16, 11], and multi-device learning [4]--where in each round, a set of m agents must each select an action and receive feedback from a stochastic, potentially heterogeneous reward function. A natural baseline approach is to run a standard single-agent linear bandit algorithm independently for each agent. While straightforward, this approach entirely ignores potential similarities across agents. When the agents' environments are similar, such independent learning leads to redundant exploration and a suboptimal use of data. In contrast, our goal is to develop collaborative learning algorithms that adaptively share information across agents--without knowing in advance how similar or different they are. This raises a fundamental question: How can collaboration reduce regret under unknown heterogeneity? To address the question, we model heterogeneity using a hierarchical/empirical Bayesian framework, where each agent's unknown parameter vector is drawn from an unknown population distribution.
arXiv.org Artificial Intelligence
Jun-23-2025
- Country:
- North America > United States > California (0.28)
- Genre:
- Research Report (1.00)
- Industry:
- Education (0.46)
- Technology: