Recent developments in deep reinforcement learning are concerned with creating decision-making agents which can perform well in various complex domains. A particular approach which has received increasing attention is multi-agent reinforcement learning, in which multiple agents learn concurrently to coordinate their actions. In such multi-agent environments, additional learning problems arise due to the continually changing decision-making policies of agents. This paper surveys recent works that address the non-stationarity problem in multi-agent deep reinforcement learning. The surveyed methods range from modifications in the training procedure, such as centralized training, to learning representations of the opponent's policy, meta-learning, communication, and decentralized learning. The survey concludes with a list of open problems and possible lines of future research.
The University of Toronto and the affiliated Vector Institute for Artificial Intelligence have announced the recruitment of two rising stars in machine learning research as part of a continued drive to assemble the best AI talent in the world. Chris Maddison and Jakob Foerster will both come to U of T having completed their doctoral research at the University of Oxford. He earned his undergraduate and master's degrees in computer science at U of T – the latter under the supervision of University Professor Emeritus Geoffrey Hinton. A senior research scientist at Google-owned AI firm DeepMind, Maddison will join U of T's departments of computer science and statistical sciences in the Faculty of Arts & Science as an assistant professor next summer. Foerster, a research scientist at Facebook AI Research, will start as an assistant professor in the department of computer and mathematical sciences at U of T Scarborough in fall of 2020.
Many real-world problems, such as network packet routing and urban traffic control, are naturally modeled as multi-agent reinforcement learning (RL) problems. However, existing multi-agent RL methods typically scale poorly in the problem size. Therefore, a key challenge is to translate the success of deep learning on single-agent RL to the multi-agent setting. A major stumbling block is that independent Q-learning, the most popular multi-agent RL method, introduces nonstationarity that makes it incompatible with the experience replay memory on which deep Q-learning relies. This paper proposes two methods that address this problem: 1) using a multi-agent variant of importance sampling to naturally decay obsolete data and 2) conditioning each agent's value function on a fingerprint that disambiguates the age of the data sampled from the replay memory. Results on a challenging decentralised variant of StarCraft unit micromanagement confirm that these methods enable the successful combination of experience replay with multi-agent RL.
In this paper, we propose actor-critic approaches by introducing an actor policy on QMIX , which can remove the monotonicity constraint of QMIX and implement a non-monotonic value function factorization for joint action-value. We evaluate our actor-critic methods on StarCraft II micromanagement tasks, and show that it has a stronger performance on maps with heterogeneous agent types.
In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint action-values conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.