Goto

Collaborating Authors

 pommerman


Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach

Huynh, Nhat-Minh, Cao, Hoang-Giang, Wu, I-Chen

arXiv.org Artificial Intelligence

Pommerman is a multi-agent environment that has received considerable attention from researchers in recent years. This environment is an ideal benchmark for multi-agent training, providing a battleground for two teams with communication capabilities among allied agents. Pommerman presents significant challenges for model-free reinforcement learning due to delayed action effects, sparse rewards, and false positives, where opponent players can lose due to their own mistakes. This study introduces a system designed to train multi-agent systems to play Pommerman using a combination of curriculum learning and population-based self-play. We also tackle two challenging problems when deploying the multi-agent training system for competitive games: sparse reward and suitable matchmaking mechanism. Specifically, we propose an adaptive annealing factor based on agents' performance to adjust the dense exploration reward during training dynamically. Additionally, we implement a matchmaking mechanism utilizing the Elo rating system to pair agents effectively. Our experimental results demonstrate that our trained agent can outperform top learning agents without requiring communication among allied agents.


Know your Enemy: Investigating Monte-Carlo Tree Search with Opponent Models in Pommerman

Weil, Jannis, Czech, Johannes, Meuser, Tobias, Kersting, Kristian

arXiv.org Artificial Intelligence

In combination with Reinforcement Learning, Monte-Carlo Tree Search has shown to outperform human grandmasters in games such as Chess, Shogi and Go with little to no prior domain knowledge. However, most classical use cases only feature up to two players. Scaling the search to an arbitrary number of players presents a computational challenge, especially if decisions have to be planned over a longer time horizon. In this work, we investigate techniques that transform general-sum multiplayer games into single-player and two-player games that consider other agents to act according to given opponent models. For our evaluation, we focus on the challenging Pommerman environment which involves partial observability, a long time horizon and sparse rewards. In combination with our search methods, we investigate the phenomena of opponent modeling using heuristics and self-play. Overall, we demonstrate the effectiveness of our multiplayer search variants both in a supervised learning and reinforcement learning setting.


Learning from Multiple Independent Advisors in Multi-agent Reinforcement Learning

Subramanian, Sriram Ganapathi, Taylor, Matthew E., Larson, Kate, Crowley, Mark

arXiv.org Artificial Intelligence

Multi-agent reinforcement learning typically suffers from the problem of sample inefficiency, where learning suitable policies involves the use of many data samples. Learning from external demonstrators is a possible solution that mitigates this problem. However, most prior approaches in this area assume the presence of a single demonstrator. Leveraging multiple knowledge sources (i.e., advisors) with expertise in distinct aspects of the environment could substantially speed up learning in complex environments. This paper considers the problem of simultaneously learning from multiple independent advisors in multi-agent reinforcement learning. The approach leverages a two-level Q-learning architecture, and extends this framework from single-agent to multi-agent settings. We provide principled algorithms that incorporate a set of advisors by both evaluating the advisors at each state and subsequently using the advisors to guide action selection. We also provide theoretical convergence and sample complexity guarantees. Experimentally, we validate our approach in three different test-beds and show that our algorithms give better performances than baselines, can effectively integrate the combined expertise of different advisors, and learn to ignore bad advice.


A Fast Evolutionary adaptation for MCTS in Pommerman

Panwar, Harsh, Chatterjee, Saswata, Dube, Wil

arXiv.org Artificial Intelligence

Artificial Intelligence, when amalgamated with games makes the ideal structure for research and advancing the field. Multi-agent games have multiple controls for each agent which generates huge amounts of data while increasing search complexity. Thus, we need advanced search methods to find a solution and create an artificially intelligent agent. In this paper, we propose our novel Evolutionary Monte Carlo Tree Search (FEMCTS) agent which borrows ideas from Evolutionary Algorthims (EA) and Monte Carlo Tree Search (MCTS) to play the game of Pommerman. It outperforms Rolling Horizon Evolutionary Algorithm (RHEA) significantly in high observability settings and performs almost as well as MCTS for most game seeds, outperforming it in some cases.


School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget

Shelke, Omkar, Meisheri, Hardik, Khadilkar, Harshad

arXiv.org Artificial Intelligence

Pommerman is a hybrid cooperative/adversarial multi-agent environment, with challenging characteristics in terms of partial observability, limited or no communication, sparse and delayed rewards, and restrictive computational time limits. This makes it a challenging environment for reinforcement learning (RL) approaches. In this paper, we focus on developing a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games, starting from a fixed base policy (which is itself trained to imitate a noisy expert policy). All RL algorithms starting from the base policy use vanilla proximal-policy optimization (PPO) with the same reward function, and the only difference between their training is the mix and sequence of opponent policies. One expects that beginning training with simpler opponents and then gradually increasing the opponent difficulty will facilitate faster learning, leading to more robust policies compared against a baseline where all available opponent policies are introduced from the start. We test this hypothesis and show that within constrained computational budgets, it is in fact better to "learn in the school of hard knocks", i.e., against all available opponent policies nearly from the start. We also include ablation studies where we study the effect of modifying the base environment properties of ammo and bomb blast strength on the agent performance.


Learning to Play Imperfect-Information Games by Imitating an Oracle Planner

Boney, Rinu, Ilin, Alexander, Kannala, Juho, Seppänen, Jarno

arXiv.org Artificial Intelligence

We consider learning to play multiplayer imperfect-information games with simultaneous moves and large state-action spaces. Previous attempts to tackle such challenging games have largely focused on model-free learning methods, often requiring hundreds of years of experience to produce competitive agents. Our approach is based on model-based planning. We tackle the problem of partial observability by first building an (oracle) planner that has access to the full state of the environment and then distilling the knowledge of the oracle to a (follower) agent which is trained to play the imperfect-information game by imitating the oracle's choices. We experimentally show that planning with naive Monte Carlo tree search does not perform very well in large combinatorial action spaces. We therefore propose planning with a fixed-depth tree search and decoupled Thompson sampling for action selection. We show that the planner is able to discover efficient playing strategies in the games of Clash Royale and Pommerman and the follower policy successfully learns to implement them by training on a few hundred battles.


Accelerating Training in Pommerman with Imitation and Reinforcement Learning

Meisheri, Hardik, Shelke, Omkar, Verma, Richa, Khadilkar, Harshad

arXiv.org Machine Learning

The Pommerman simulation was recently developed to mimic the classic Japanese game Bomberman, and focuses on competitive gameplay in a multi-agent setting. We focus on the 2$\times$2 team version of Pommerman, developed for a competition at NeurIPS 2018. Our methodology involves training an agent initially through imitation learning on a noisy expert policy, followed by a proximal-policy optimization (PPO) reinforcement learning algorithm. The basic PPO approach is modified for stable transition from the imitation learning phase through reward shaping, action filters based on heuristics, and curriculum learning. The proposed methodology is able to beat heuristic and pure reinforcement learning baselines with a combined 100,000 training games, significantly faster than other non-tree-search methods in literature. We present results against multiple agents provided by the developers of the simulation, including some that we have enhanced. We include a sensitivity analysis over different parameters, and highlight undesirable effects of some strategies that initially appear promising. Since Pommerman is a complex multi-agent competitive environment, the strategies developed here provide insights into several real-world problems with characteristics such as partial observability, decentralized execution (without communication), and very sparse and delayed rewards.


On Hard Exploration for Reinforcement Learning: a Case Study in Pommerman

Gao, Chao, Kartal, Bilal, Hernandez-Leal, Pablo, Taylor, Matthew E.

arXiv.org Artificial Intelligence

How to best explore in domains with sparse, delayed, and deceptive rewards is an important open problem for reinforcement learning (RL). This paper considers one such domain, the recently-proposed multi-agent benchmark of Pommerman. This domain is very challenging for RL --- past work has shown that model-free RL algorithms fail to achieve significant learning without artificially reducing the environment's complexity. In this paper, we illuminate reasons behind this failure by providing a thorough analysis on the hardness of random exploration in Pommerman. While model-free random exploration is typically futile, we develop a model-based automatic reasoning module that can be used for safer exploration by pruning actions that will surely lead the agent to death. We empirically demonstrate that this module can significantly improve learning.


Action Guidance with MCTS for Deep Reinforcement Learning

Kartal, Bilal, Hernandez-Leal, Pablo, Taylor, Matthew E.

arXiv.org Machine Learning

Deep reinforcement learning has achieved great successes in recent years, however, one main challenge is the sample inefficiency. In this paper, we focus on how to use action guidance by means of a non-expert demonstrator to improve sample efficiency in a domain with sparse, delayed, and possibly deceptive rewards: the recently-proposed multi-agent benchmark of Pommerman. We propose a new framework where even a non-expert simulated demonstrator, e.g., planning algorithms such as Monte Carlo tree search with a small number rollouts, can be integrated within asynchronous distributed deep reinforcement learning methods. Compared to a vanilla deep RL algorithm, our proposed methods both learn faster and converge to better policies on a two-player mini version of the Pommerman game. Introduction Deep reinforcement learning (DRL) has enabled better scalability and generalization for challenging domains (Arulku-maran et al. 2017; Li 2017; Hernandez-Leal, Kartal, and Taylor 2018) such as Atari games (Mnih et al. 2015), Go (Silver et al. 2016) and multiagent games (e.g., Starcraft II and DOT A 2) (OpenAI 2018). However, one of the current biggest challenges for DRL is sample efficiency (Y u 2018). On the one hand, once a DRL agent is trained, it can be deployed to act in real-time by only performing an inference through the trained model. On the other hand, planning methods such as Monte Carlo tree search (MCTS) (Browne et al. 2012) do not have a training phase, but they perform computationally costly simulation based rollouts (assuming access to a simulator) to find the best action to take. There are several ways to get the best of both DRL and search methods.


Terminal Prediction as an Auxiliary Task for Deep Reinforcement Learning

Kartal, Bilal, Hernandez-Leal, Pablo, Taylor, Matthew E.

arXiv.org Machine Learning

Deep reinforcement learning has achieved great successes in recent years, but there are still open challenges, such as convergence to locally optimal policies and sample inefficiency. In this paper, we contribute a novel self-supervised auxiliary task, i.e., Terminal Prediction (TP), estimating temporal closeness to terminal states for episodic tasks. The intuition is to help representation learning by letting the agent predict how close it is to a terminal state, while learning its control policy. Although TP could be integrated with multiple algorithms, this paper focuses on Asynchronous Advantage Actor-Critic (A3C) and demonstrating the advantages of A3C-TP. Our extensive evaluation includes: a set of Atari games, the BipedalWalker domain, and a mini version of the recently proposed multi-agent Pommerman game. Our results on Atari games and the BipedalWalker domain suggest that A3C-TP outperforms standard A3C in most of the tested domains and in others it has similar performance. In Pommerman, our proposed method provides significant improvement both in learning efficiency and converging to better policies against different opponents.