The Power of Populations in Decentralized Bandits

Lazarsfeld, John, Alistarh, Dan

arXiv.org Artificial Intelligence 

The multi-armed bandit problem, where a single learning agent chooses actions over a sequence of rounds in order to maximize its total reward, is among the most well-studied in online learning. Distributed, multi-agent variants of this problem have also been widely studied under various constraints; one particular such line of work is the cooperative multi-agent bandit setting, where agents are connected over a communication graph and play against a common bandit instance, choosing actions in parallel over T rounds. Each agent locally runs a bandit algorithm that may involve communication with neighbors, and the information exchanged can be used to determine an agent's future actions. This cooperative setting has been studied for both stochastic (Szorenyi et al., 2013; Landgren et al., 2016; Kolla et al., 2018; Martínez-Rubio et al., 2019) and non-stochastic bandits (Awerbuch & Kleinberg, 2008; Cesa-Bianchi et al., 2016; Bar-On & Mansour, 2019), where communication between agents has been shown to improve an agent's regret on average, compared to each agent locally running a centralized bandit algorithm without any communication. However, most prior works in this setting require that every agent communicate with all its neighbors in each round (as pointed out by Cesa-Bianchi et al. (2016), this resembles the LOCAL model of distributed computation (Linial, 1992)). When the underlying graph is dense, this volume of communication may be prohibitively large, which is a known bottleneck in many practical distributed machine learning settings (Alistarh et al., 2017; Koloskova et al., 2019a;b). In contrast, much less is known about cooperative multi-agent bandits in more lightweight decentralized models of distributed communication, such as the GOSSIP model (Boyd et al., 2006; Shah