We study the decentralized multi-agent multi-armed bandit problem for agents that communicate with probability over a network defined by a $d$-regular graph. Every edge in the graph has probabilistic weight $p$ to account for the ($1\!-\!p$) probability of a communication link failure. At each time step, each agent chooses an arm and receives a numerical reward associated with the chosen arm. After each choice, each agent observes the last obtained reward of each of its neighbors with probability $p$. We propose a new Upper Confidence Bound (UCB) based algorithm and analyze how agent-based strategies contribute to minimizing group regret in this probabilistic communication setting. We provide theoretical guarantees that our algorithm outperforms state-of-the-art algorithms. We illustrate our results and validate the theoretical claims using numerical simulations.
We investigate the benefits of heterogeneity in multi-agent explore-exploit decision making where the goal of the agents is to maximize cumulative group reward. To do so we study a class of distributed stochastic bandit problems in which agents communicate over a multi-star network and make sequential choices among options in the same uncertain environment. Typically, in multi-agent bandit problems, agents use homogeneous decision-making strategies. However, group performance can be improved by incorporating heterogeneity into the choices agents make, especially when the network graph is irregular, i.e. when agents have different numbers of neighbors. We design and analyze new heterogeneous explore-exploit strategies, using the multi-star as the model irregular network graph. The key idea is to enable center agents to do more exploring than they would do using the homogeneous strategy, as a means of providing more useful data to the peripheral agents. In the case all agents broadcast their reward values and choices to their neighbors with the same probability, we provide theoretical guarantees that group performance improves under the proposed heterogeneous strategies as compared to under homogeneous strategies. We use numerical simulations to illustrate our results and to validate our theoretical bounds.
This paper considers a multi-armed bandit (MAB) problem in which multiple mobile agents receive rewards by sampling from a collection of spatially dispersed stochastic processes, called bandits. The goal is to formulate a decentralized policy for each agent, in order to maximize the total cumulative reward over all agents, subject to option availability and inter-agent communication constraints. The problem formulation is motivated by applications in which a team of autonomous mobile robots cooperates to accomplish an exploration and exploitation task in an uncertain environment. Bandit locations are represented by vertices of the spatial graph. At any time, an agent's option consist of sampling the bandit at its current location, or traveling along an edge of the spatial graph to a new bandit location. Communication constraints are described by a directed, non-stationary, stochastic communication graph. At any time, agents may receive data only from their communication graph in-neighbors. For the case of a single agent on a fully connected spatial graph, it is known that the expected regret for any optimal policy is necessarily bounded below by a function that grows as the logarithm of time. A class of policies called upper confidence bound (UCB) algorithms asymptotically achieve logarithmic regret for the classical MAB problem. In this paper, we propose a UCB-based decentralized motion and option selection policy and a non-stationary stochastic communication protocol that guarantee logarithmic regret. To our knowledge, this is the first such decentralized policy for non-fully connected spatial graphs with communication constraints. When the spatial graph is fully connected and the communication graph is stationary, our decentralized algorithm matches or exceeds the best reported prior results from the literature.
Existing studies of the Multi Agent Multi Armed Bandit (MAMAB) problem, with the exception of a very few, consider the case where the agents observe their neighbors according to a static network graph. They also mostly rely on a running consensus for the estimation of the option rewards. Two of the exceptions consider a problem where agents observe instantaneous rewards and actions of their neighbors through an iid ER graph process based communication strategy. In this paper we propose a UCB based option allocation rule that guarantees logarithmic regret even if the graph depends on the history of choices made by the agents. The paper also proposes a novel communication strategy that significantly outperforms the iid ER graph based communication strategy. In both the ER graph and the dependent graph strategy, the regret is shown to depend on the connectivity of the graph in a particularly interesting way where there exists an optimal connectivity of the graph that is less than the full connectivity of the graph.
We study a distributed decision-making problem in which multiple agents face the same multi-armed bandit (MAB), and each agent makes sequential choices among arms to maximize its own individual reward. The agents cooperate by sharing their estimates over a fixed communication graph. We consider an unconstrained reward model in which two or more agents can choose the same arm and collect independent rewards. And we consider a constrained reward model in which agents that choose the same arm at the same time receive no reward. We design a dynamic, consensus-based, distributed estimation algorithm for cooperative estimation of mean rewards at each arm. We leverage the estimates from this algorithm to develop two distributed algorithms: coop-UCB2 and coop-UCB2-selective-learning, for the unconstrained and constrained reward models, respectively. We show that both algorithms achieve group performance close to the performance of a centralized fusion center. Further, we investigate the influence of the communication graph structure on performance. We propose a novel graph explore-exploit index that predicts the relative performance of groups in terms of the communication graph, and we propose a novel nodal explore-exploit centrality index that predicts the relative performance of agents in terms of the agent locations in the communication graph.