Multi-Player Bandits: A Trekking Approach

Hanawal, Manjesh K., Darak, Sumit J.

arXiv.org Machine Learning 

Abstract--We study stochastic multi-armed bandits with many players. The players do not know the number of players, cannot communicate with each other and if multiple players select a common arm they collide and none of them receive any reward. We consider the static scenario, where the number of players remains fixed, and the dynamic scenario, where the players enter and leave at any time. We provide algorithms based on a novel'trekking approach' that guarantees constant regret for the static case and sub-linear regret for the dynamic case with high probability. The trekking approach eliminates the need to estimate the number of players resulting in fewer collisions and improved regret performance compared to the state-of-theart algorithms. We also develop an epoch-less algorithm that eliminates any requirement of time synchronization across the players provided each player can detect the presence of other players on an arm. We validate our theoretical guarantees using simulation based and real test-bed based experiments. Multi-player multi-armed bandits (MPMAB) is a variant of the stochastic multi-armed bandits [1]-[3] where multiple players aim to maximize sum of their rewards playing the same set of arms. In this setting, the players do not communicate with each other and may not know number of other players in the game. If two or more players select the same arm simultaneously, they experience'collision' and none of them receive any reward. Our goal in this work is to develop distributed algorithms that aim to achieve high total rewards while keeping the number of collisions as low as possible. The study of MPMAB is mainly motivated from the ad-hoc cognitive radio networks (CRN) where multiple users transmit on a common set of channels (unlicensed spectrum) without any communication among them [4], [5]. Due to the ad hoc nature of such networks, a central controller, or a common control channels for coordination, may not be available and all channel selection decisions have to be done in a decentralized fashion [4]-[6]. Such models are being envisioned for futuristic ultra-dense wireless communication networks that can offer very high peak rates [7]. The quality of the channels are unknown to the users and their goal is to maximize number of successful transmissions (or sum rate/ throughput) in the network.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found