Goto

Collaborating Authors

 Agents


Human-Level Performance in No-Press Diplomacy via Equilibrium Search

arXiv.org Artificial Intelligence

Prior AI breakthroughs in complex games have focused on either the purely adversarial or purely cooperative settings. In contrast, Diplomacy is a game of shifting alliances that involves both cooperation and competition. For this reason, Diplomacy has proven to be a formidable research challenge. In this paper we describe an agent for the no-press variant of Diplomacy that combines supervised learning on human data with one-step lookahead search via external regret minimization. External regret minimization techniques have been behind previous AI successes in adversarial games, most notably poker, but have not previously been shown to be successful in large-scale games involving cooperation. We show that our agent greatly exceeds the performance of past no-press Diplomacy bots, is unexploitable by expert humans, and achieves a rank of 23 out of 1,128 human players when playing anonymous games on a popular Diplomacy website. A primary goal for AI research is to develop agents that can act optimally in real-world multi-agent interactions (i.e., games). However, previous large-scale game AI results have focused on either purely competitive or purely cooperative settings. In contrast, real-world games, such as business negotiations, politics, and traffic navigation, involve a far more complex mixture of cooperation and competition. In such settings, the theoretical grounding for the techniques used in previous AI breakthroughs falls apart. In this paper we augment neural policies trained through imitation learning with regret minimization search techniques, and evaluate on the benchmark game of no-press Diplomacy.


Randomized Value Functions via Posterior State-Abstraction Sampling

arXiv.org Artificial Intelligence

State abstraction has been an essential tool for dramatically improving the sample efficiency of reinforcement-learning algorithms. Indeed, by exposing and accentuating various types of latent structure within the environment, different classes of state abstraction have enabled improved theoretical guarantees and empirical performance. When dealing with state abstractions that capture structure in the value function, however, a standard assumption is that the true abstraction has been supplied or unrealistically computed a priori, leaving open the question of how to efficiently uncover such latent structure while jointly seeking out optimal behavior. Taking inspiration from the bandit literature, we propose that an agent seeking out latent task structure must explicitly represent and maintain its uncertainty over that structure as part of its overall uncertainty about the environment. We introduce a practical algorithm for doing this using two posterior distributions over state abstractions and abstract-state values. In empirically validating our approach, we find that substantial performance gains lie in the multi-task setting where tasks share a common, low-dimensional representation.


Motion-Encoded Particle Swarm Optimization for Moving Target Search Using UAVs

arXiv.org Artificial Intelligence

This paper presents a novel algorithm named the motion-encoded particle swarm optimization (MPSO) for finding a moving target with unmanned aerial vehicles (UAVs). From the Bayesian theory, the search problem can be converted to the optimization of a cost function that represents the probability of detecting the target. Here, the proposed MPSO is developed to solve that problem by encoding the search trajectory as a series of UAV motion paths evolving over the generation of particles in a PSO algorithm. This motion-encoded approach allows for preserving important properties of the swarm including the cognitive and social coherence, and thus resulting in better solutions. Results from extensive simulations with existing methods show that the proposed MPSO improves the detection performance by 24\% and time performance by 4.71 times compared to the original PSO, and moreover, also outperforms other state-of-the-art metaheuristic optimization algorithms including the artificial bee colony (ABC), ant colony optimization (ACO), genetic algorithm (GA), differential evolution (DE), and tree-seed algorithm (TSA) in most search scenarios. Experiments have been conducted with real UAVs in searching for a dynamic target in different scenarios to demonstrate MPSO merits in a practical application.


RODE: Learning Roles to Decompose Multi-Agent Tasks

arXiv.org Machine Learning

Role-based learning holds the promise of achieving scalable multi-agent learning by decomposing complex tasks using roles. However, it is largely unclear how to efficiently discover such a set of roles. To solve this problem, we propose to first decompose joint action spaces into restricted role action spaces by clustering actions according to their effects on the environment and other agents. Learning a role selector based on action effects makes role discovery much easier because it forms a bi-level learning hierarchy -- the role selector searches in a smaller role space and at a lower temporal resolution, while role policies learn in significantly reduced primitive action-observation spaces. We further integrate information about action effects into the role policies to boost learning efficiency and policy generalization. By virtue of these advances, our method (1) outperforms the current state-of-the-art MARL algorithms on 10 of the 14 scenarios that comprise the challenging StarCraft II micromanagement benchmark and (2) achieves rapid transfer to new environments with three times the number of agents. Demonstrative videos are available at https://sites.google.com/view/rode-marl .


Off-Policy Multi-Agent Decomposed Policy Gradients

arXiv.org Machine Learning

Multi-agent policy gradient (MAPG) methods recently witness vigorous progress. However, there is a significant performance discrepancy between MAPG methods and state-of-the-art multi-agent value-based approaches. In this paper, we investigate causes that hinder the performance of MAPG algorithms and present a multi-agent decomposed policy gradient method (DOP). This method introduces the idea of value function decomposition into the multi-agent actor-critic framework. Based on this idea, DOP supports efficient off-policy learning and addresses the issue of centralized-decentralized mismatch and credit assignment in both discrete and continuous action spaces. We formally show that DOP critics have sufficient representational capability to guarantee convergence. In addition, empirical evaluations on the StarCraft II micromanagement benchmark and multi-agent particle environments demonstrate that DOP significantly outperforms both state-of-the-art value-based and policy-based multi-agent reinforcement learning algorithms. Demonstrative videos are available at https://sites.google.com/view/dop-mapg/.


Explainability via Responsibility

arXiv.org Artificial Intelligence

Procedural Content Generation via Machine Learning (PCGML) refers to a group of methods for creating game content (e.g. platformer levels, game maps, etc.) using machine learning models. PCGML approaches rely on black box models, which can be difficult to understand and debug by human designers who do not have expert knowledge about machine learning. This can be even more tricky in co-creative systems where human designers must interact with AI agents to generate game content. In this paper we present an approach to explainable artificial intelligence in which certain training instances are offered to human users as an explanation for the AI agent's actions during a co-creation process. We evaluate this approach by approximating its ability to provide human users with the explanations of AI agent's actions and helping them to more efficiently cooperate with the AI agent.


A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

arXiv.org Artificial Intelligence

Model-based algorithms---algorithms that decouple learning of the model and planning given the model---are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm \emph{Optimistic Nash Value Iteration} (Nash-VI) for two-player zero-sum Markov games that is able to output an $\epsilon$-approximate Nash policy in $\tilde{\mathcal{O}}(H^3SAB/\epsilon^2)$ episodes of game playing, where $S$ is the number of states, $A,B$ are the number of actions for the two players respectively, and $H$ is the horizon length. This is the first algorithm that matches the information-theoretic lower bound $\Omega(H^3S(A+B)/\epsilon^2)$ except for a $\min\{A,B\}$ factor, and compares favorably against the best known model-free algorithm if $\min\{A,B\}=o(H^3)$. In addition, our Nash-VI outputs a single Markov policy with optimality guarantee, while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.


EECBS: A Bounded-Suboptimal Search for Multi-Agent Path Finding

arXiv.org Artificial Intelligence

Multi-Agent Path Finding (MAPF), i.e., finding collision-free paths for multiple robots, is important for many applications where small runtimes are important, including the kind of automated warehouses operated by Amazon. CBS is a leading two-level search algorithm for solving MAPF optimally. ECBS is a bounded-suboptimal variant of CBS that uses focal search to speed up CBS by sacrificing optimality and instead guaranteeing that the costs of its solution are within a given factor of optimal. In this paper, we study how to decrease its runtime even further using inadmissible heuristics. Motivated by Explicit Estimation Search (EES), we propose Explicit Estimation CBS (EECBS), a new bounded-suboptimal variant of CBS, that uses online learning to inadmissibly estimate the cost of the solution under each high-level node and uses EES to choose which high-level node to expand next. We also investigate recent improvements to CBS and adapt them to EECBS. We find that EECBS with the improvements runs significantly faster than the MAPF algorithms ECBS, BCP-7, and eMDD-SAT on a variety of MAPF instances. We hope that the scalability of EECBS enables wider adoption of MAPF formulations in practical applications.


The Interplay Between Artificial Intelligence and Uncertainty

#artificialintelligence

In this first article, we highlight how intelligence and rationality are tightly coupled with the uncertainty present in the world. We also discuss how uncertainty plays a critical role in designing beneficial general-purpose artificial intelligence (AI), as described by the work of Stuart Russel and Peter Norvig on Modern AI [1][2]. Human intelligence, both social and individual, is what has been driving advances achieved by the human civilization. Having access to even greater intelligence in the form of machine artificial intelligence (AI) can potentially lead to even further advances, and will help us solve major problems such as eliminating poverty and disease, solving open scientific and mathematical problems, and offering personal assistance targeting billions of people worldwide. This is subject of course to the finite resources of land and raw material available on earth.


Interactive Reinforcement Learning for Feature Selection with Decision Tree in the Loop

arXiv.org Machine Learning

We study the problem of balancing effectiveness and efficiency in automated feature selection. After exploring many feature selection methods, we observe a computational dilemma: 1) traditional feature selection is mostly efficient, but difficult to identify the best subset; 2) the emerging reinforced feature selection automatically navigates to the best subset, but is usually inefficient. Can we bridge the gap between effectiveness and efficiency under automation? Motivated by this dilemma, we aim to develop a novel feature space navigation method. In our preliminary work, we leveraged interactive reinforcement learning to accelerate feature selection by external trainer-agent interaction. In this journal version, we propose a novel interactive and closed-loop architecture to simultaneously model interactive reinforcement learning (IRL) and decision tree feedback (DTF). Specifically, IRL is to create an interactive feature selection loop and DTF is to feed structured feature knowledge back to the loop. First, the tree-structured feature hierarchy from decision tree is leveraged to improve state representation. In particular, we represent the selected feature subset as an undirected graph of feature-feature correlations and a directed tree of decision features. We propose a new embedding method capable of empowering graph convolutional network to jointly learn state representation from both the graph and the tree. Second, the tree-structured feature hierarchy is exploited to develop a new reward scheme. In particular, we personalize reward assignment of agents based on decision tree feature importance. In addition, observing agents' actions can be feedback, we devise another reward scheme, to weigh and assign reward based on the feature selected frequency ratio in historical action records. Finally, we present extensive experiments on real-world datasets to show the improved performance.