Goto

Collaborating Authors

 multiplayer bandit


Queue Up Your Regrets: Achieving the Dynamic Capacity Region of Multiplayer Bandits

Neural Information Processing Systems

Consider $N$ cooperative agents such that for $T$ turns, each agent n takes an action $a_{n}$ and receives a stochastic reward $r_{n}\left(a_{1},\ldots,a_{N}\right)$. Agents cannot observe the actions of other agents and do not know even their own reward function. The agents can communicate with their neighbors on a connected graph $G$ with diameter $d\left(G\right)$. We want each agent $n$ to achieve an expected average reward of at least $\lambda_{n}$ over time, for a given quality of service (QoS) vector $\boldsymbol{\lambda}$. A QoS vector $\boldsymbol{\lambda}$ is not necessarily achievable.


Queue Up Your Regrets: Achieving the Dynamic Capacity Region of Multiplayer Bandits

Neural Information Processing Systems

Consider N cooperative agents such that for T turns, each agent n takes an action a_{n} and receives a stochastic reward r_{n}\left(a_{1},\ldots,a_{N}\right) . Agents cannot observe the actions of other agents and do not know even their own reward function. The agents can communicate with their neighbors on a connected graph G with diameter d\left(G\right) . We want each agent n to achieve an expected average reward of at least \lambda_{n} over time, for a given quality of service (QoS) vector \boldsymbol{\lambda} . By giving up on immediate reward, knowing that the other agents will compensate later, agents can improve their achievable capacity region.


Constant or logarithmic regret in asynchronous multiplayer bandits

arXiv.org Artificial Intelligence

Multiplayer bandits have recently been extensively studied because of their application to cognitive radio networks. While the literature mostly considers synchronous players, radio networks (e.g. for IoT) tend to have asynchronous devices. This motivates the harder, asynchronous multiplayer bandits problem, which was first tackled with an explore-then-commit (ETC) algorithm (see Dakdouk, 2022), with a regret upper-bound in $\mathcal{O}(T^{\frac{2}{3}})$. Before even considering decentralization, understanding the centralized case was still a challenge as it was unknown whether getting a regret smaller than $\Omega(T^{\frac{2}{3}})$ was possible. We answer positively this question, as a natural extension of UCB exhibits a $\mathcal{O}(\sqrt{T\log(T)})$ minimax regret. More importantly, we introduce Cautious Greedy, a centralized algorithm that yields constant instance-dependent regret if the optimal policy assigns at least one player on each arm (a situation that is proved to occur when arm means are close enough). Otherwise, its regret increases as the sum of $\log(T)$ over some sub-optimality gaps. We provide lower bounds showing that Cautious Greedy is optimal in the data-dependent terms. Therefore, we set up a strong baseline for asynchronous multiplayer bandits and suggest that learning the optimal policy in this problem might be easier than thought, at least with centralization.


A survey on multi-player bandits

arXiv.org Artificial Intelligence

Due mostly to its application to cognitive radio networks, multiplayer bandits gained a lot of interest in the last decade. A considerable progress has been made on its theoretical aspect. However, the current algorithms are far from applicable and many obstacles remain between these theoretical results and a possible implementation of multiplayer bandits algorithms in real cognitive radio networks. This survey contextualizes and organizes the rich multiplayer bandits literature. In light of the existing works, some clear directions for future research appear. We believe that a further study of these different directions might lead to theoretical algorithms adapted to real-world situations.


New Algorithms for Multiplayer Bandits when Arm Means Vary Among Players

arXiv.org Machine Learning

We study multiplayer stochastic multi-armed bandit problems in which the players cannot communicate,and if two or more players pull the same arm, a collision occurs and the involved players receive zero reward.Moreover, we assume each arm has a different mean for each player. Let $T$ denote the number of rounds.An algorithm with regret $O((\log T)^{2+\kappa})$ for any constant $\kappa$ was recently presented by Bistritz and Leshem (NeurIPS 2018), who left the existence of an algorithm with $O(\log T)$ regret as an open question. In this paper, we provide an affirmative answer to this question in the case when there is a unique optimal assignment of players to arms. For the general case we present an algorithm with expected regret $O((\log T)^{1+\kappa})$, for any $\kappa>0$.