Goto

Collaborating Authors

 Cote, Enrique Munoz de


Decentralised Learning in Systems With Many, Many Strategic Agents

AAAI Conferences

Although multi-agent reinforcement learning can tackle systems of strategically interacting entities, it currently fails in scalability and lacks rigorous convergence guarantees. Crucially, learning in multi-agent systems can become intractable due to the explosion in the size of the state-action space as the number of agents increases. In this paper, we propose a method for computing closed-loop optimal policies in multi-agent systems that scales independently of the number of agents. This allows us to show, for the first time, successful convergence to optimal behaviour in systems with an unbounded number of interacting adaptive learners. Studying the asymptotic regime of N-player stochastic games, we devise a learning protocol that is guaranteed to converge to equilibrium policies even when the number of agents is extremely large. Our method is model-free and completely decentralised so that each agent need only observe its local state information and its realised rewards. We validate these theoretical results by showing convergence to Nash-equilibrium policies in applications from economics and control theory with thousands of strategically interacting agents.


Identifying and Tracking Switching, Non-Stationary Opponents: A Bayesian Approach

AAAI Conferences

In many situations, agents are required to use a set of strategies (behaviors) and switch among them during the course of an interaction. This work focuses on the problem of recognizing the strategy used by an agent within a small number of interactions. We propose using a Bayesian framework to address this problem. Bayesian policy reuse (BPR) has been empirically shown to be efficient at correctly detecting the best policy to use from a library in sequential decision tasks. In this paper we extend BPR to adversarial settings, in particular, to opponents that switch from one stationary strategy to another. Our proposed extension enables learning new models in an online fashion when the learning agent detects that the current policies are not performing optimally. Experiments presented in repeated games show that our approach is capable of efficiently detecting opponent strategies and reacting quickly to behavior switches, thereby yielding better performance than state-of-the-art approaches in terms of average rewards.


A Distributed Algorithm for Optimising over Pure Strategy Nash Equilibria

AAAI Conferences

We develop an efficient algorithm for computing pure strategy Nash equilibria that satisfy various criteria (such as the utilitarian or Nash-Bernoulli social welfare functions) in games with sparse interaction structure. Our algorithm, called Valued Nash Propagation (VNP), integrates the optimisation problem of maximising a criterion with the constraint satisfaction problem of finding a game's equilibria to construct a criterion that defines a c -semiring. Given a suitably compact game structure, this criterion can be efficiently optimised using message-passing. To this end, we first show that VNP is complete in games whose interaction structure forms a hypertree. Then, we go on to provide theoretic and empirical results justifying its use on games with arbitrary structure; in particular, we show that it computes the optimum >82% of the time and otherwise selects an equilibrium that is always within 2% of the optimum on average.


Epsilon–First Policies for Budget–Limited Multi-Armed Bandits

AAAI Conferences

We introduce the budget–limited multi–armed bandit (MAB), which captures situations where a learner’s actions are costly and constrained by a fixed budget that is incommensurable with the rewards earned from the bandit machine, and then describe a first algorithm for solving it. Since the learner has a budget, the problem’s duration is finite. Consequently an optimal exploitation policy is not to pull the optimal arm repeatedly, but to pull the combination of arms that maximises the agent’s total reward within the budget. As such, the rewards for all arms must be estimated, because any of them may appear in the optimal combination. This difference from existing MABs means that new approaches to maximising the total reward are required. To this end, we propose an epsilon–first algorithm, in which the first epsilon of the budget is used solely to learn the arms’ rewards (exploration), while the remaining 1 − epsilon is used to maximise the received reward based on those estimates (exploitation). We derive bounds on the algorithm’s loss for generic and uniform exploration methods, and compare its performance with traditional MAB algorithms under various distributions of rewards and costs, showing that it outperforms the others by up to 50%.