Goto

Collaborating Authors

 Agents


Human-level performance in 3D multiplayer games with population-based reinforcement learning

#artificialintelligence

Artificially intelligent agents are getting better and better at two-player games, but most real-world endeavors require teamwork. Jaderberg et al. designed a computer program that excels at playing the video game Quake III Arena in Capture the Flag mode, where two multiplayer teams compete in capturing the flags of the opposing team. The agents were trained by playing thousands of games, gradually learning successful strategies not unlike those favored by their human counterparts. Computer agents competed successfully against humans even when their reaction times were slowed to match those of humans. Reinforcement learning (RL) has shown great success in increasingly complex single-agent environments and two-player turn-based games. However, the real world contains multiple agents, each learning and acting independently to cooperate and compete with other agents.


Control-guided Communication: Efficient Resource Arbitration and Allocation in Multi-hop Wireless Control Systems

arXiv.org Artificial Intelligence

In future autonomous systems, wireless multi-hop communication is key to enable collaboration among distributed agents at low cost and high flexibility. When many agents need to transmit information over the same wireless network, communication becomes a shared and contested resource. Event-triggered and self-triggered control account for this by transmitting data only when needed, enabling significant energy savings. However, a solution that brings those benefits to multi-hop networks and can reallocate freed up bandwidth to additional agents or data sources is still missing. To fill this gap, we propose control-guided communication, a novel co-design approach for distributed self-triggered control over wireless multi-hop networks. The control system informs the communication system of its transmission demands ahead of time, and the communication system allocates resources accordingly. Experiments on a cyber-physical testbed show that multiple cart-poles can be synchronized over wireless, while serving other traffic when resources are available, or saving energy. These experiments are the first to demonstrate and evaluate distributed self-triggered control over low-power multi-hop wireless networks at update rates of tens of milliseconds.


Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence

arXiv.org Artificial Intelligence

Learning agents that are not only capable of taking tests but are also innovating are becoming a hot topic in artificial intelligence (AI). One of the most promising paths towards this vision is multi-agent learning, where agents act as the environment for each other, and improving each agent means proposing new problems for others. However, the existing evaluation platforms are either not compatible with multi-agent settings, or limited to a specific game. That is, there is not yet a general evaluation platform for research on multi-agent intelligence. To this end, we introduce Arena, a general evaluation platform for multi-agent intelligence with 35 games of diverse logic and representations. Furthermore, multi-agent intelligence is still at the stage where many problems remain unexplored. Therefore, we provide a building toolkit for researchers to easily invent and build novel multi-agent problems from the provided games set based on a GUI-configurable social tree and five basic multi-agent reward schemes. Finally, we provide python implementations of five state-of-the-art deep multi-agent reinforcement learning baselines. Along with the baseline implementations, we release a set of 100 best agents/teams that we can train with different training schemes for each game, as the base for evaluating agents with population performance. As such, the research community can perform comparisons under a stable and uniform standard. Code for the games, building toolkit and baselines are released at https://github.com/YuhangSong/Arena-BuildingToolkit and https://github.com/YuhangSong/Arena-Baselines.


Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling

arXiv.org Artificial Intelligence

Motivated by the many real-world applications of reinforcement learning (RL) that require safe-policy iterations, we consider the problem of off-policy evaluation (OPE) --- the problem of evaluating a new policy using the historical data obtained by different behavior policies --- under the model of nonstationary episodic Markov Decision Processes with a long horizon and large action space. Existing importance sampling (IS) methods often suffer from large variance that depends exponentially on the RL horizon $H$. To solve this problem, we consider a marginalized importance sampling (MIS) estimator that recursively estimates the state marginal distribution for the target policy at every step. MIS achieves a mean-squared error of $O(H^2R_{\max}^2\sum_{t=1}^H\mathbb E_\mu[(w_{\pi,\mu}(s_t,a_t))^2]/n)$ for large $n$, where $w_{\pi,\mu}(s_t,a_t)$ is the ratio of the marginal distribution of $t$th step under $\pi$ and $\mu$, $H$ is the horizon, $R_{\max}$ is the maximal rewards, and $n$ is the sample size. The result nearly matches the Cramer-Rao lower bounds for DAG MDP in \citet{jiang2016doubly} for most non-trivial regimes. To the best of our knowledge, this is the first OPE estimator with provably optimal dependence in $H$ and the second moments of the importance weight. Besides theoretical optimality, we empirically demonstrate the superiority of our method in time-varying, partially observable, and long-horizon RL environments.


Options as responses: Grounding behavioural hierarchies in multi-agent RL

arXiv.org Artificial Intelligence

We propose a novel hierarchical agent architecture for multi-agent reinforcement learning with concealed information. The hierarchy is grounded in the concealed information about other players, which resolves "the chicken or the egg" nature of option discovery. We factorise the value function over a latent representation of the concealed information and then re-use this latent space to factorise the policy into options. Low-level policies (options) are trained to respond to particular states of other agents grouped by the latent representation, while the top level (meta-policy) learns to infer the latent representation from its own observation thereby to select the right option. This grounding facilitates credit assignment across the levels of hierarchy. We show that this helps generalisation---performance against a held-out set of pre-trained competitors, while training in self- or population-play---and resolution of social dilemmas in self-play.


Escaping the State of Nature: A Hobbesian Approach to Cooperation in Multi-agent Reinforcement Learning

arXiv.org Artificial Intelligence

Cooperation is a phenomenon that has been widely studied across many different disciplines. In the field of computer science, the modularity and robustness of multi-agent systems offer significant practical advantages over individual machines. At the same time, agents using standard reinforcement learning algorithms often fail to achieve long-term, cooperative strategies in unstable environments when there are short-term incentives to defect. Political philosophy, on the other hand, studies the evolution of cooperation in humans who face similar incentives to act individualistically, but nevertheless succeed in forming societies. Thomas Hobbes in Leviathan provides the classic analysis of the transition from a pre-social State of Nature, where consistent defection results in a constant state of war, to stable political community through the institution of an absolute Sovereign. This thesis argues that Hobbes's natural and moral philosophy are strikingly applicable to artificially intelligent agents and aims to show that his political solutions are experimentally successful in producing cooperation among modified Q-Learning agents. Cooperative play is achieved in a novel Sequential Social Dilemma called the Civilization Game, which models the State of Nature by introducing the Hobbesian mechanisms of opponent learning awareness and majoritarian voting, leading to the establishment of a Sovereign.


Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

arXiv.org Artificial Intelligence

This paper investigates the use of intrinsic reward to guide exploration in multi-agent reinforcement learning. We discuss the challenges in applying intrinsic reward to multiple collaborative agents and demonstrate how unreliable reward can prevent decentralized agents from learning the optimal policy. We address this problem with a novel framework, Independent Centrally-assisted Q-learning (ICQL), in which decentralized agents share control and an experience replay buffer with a centralized agent. Only the centralized agent is intrinsically rewarded, but the decentralized agents still benefit from improved exploration, without the distraction of unreliable incentives.


Ease-of-Teaching and Language Structure from Emergent Communication

arXiv.org Artificial Intelligence

Artificial agents have been shown to learn to communicate when needed to complete a cooperative task. Some level of language structure (e.g., compositionality) has been found in the learned communication protocols. This observed structure is often the result of specific environmental pressures during training. By introducing new agents periodically to replace old ones, sequentially and within a population, we explore such a new pressure -- ease of teaching -- and show its impact on the structure of the resulting language.


Finding Friend and Foe in Multi-Agent Games

arXiv.org Machine Learning

Recent breakthroughs in AI for multi-agent games like Go, Poker, and Dota, have seen great strides in recent years. Yet none of these games address the real-life challenge of cooperation in the presence of unknown and uncertain teammates. This challenge is a key game mechanism in hidden role games. Here we develop the DeepRole algorithm, a multi-agent reinforcement learning agent that we test on The Resistance: Avalon, the most popular hidden role game. DeepRole combines counterfactual regret minimization (CFR) with deep value networks trained through self-play. Our algorithm integrates deductive reasoning into vector-form CFR to reason about joint beliefs and deduce partially observable actions. We augment deep value networks with constraints that yield interpretable representations of win probabilities. These innovations enable DeepRole to scale to the full Avalon game. Empirical game-theoretic methods show that DeepRole outperforms other hand-crafted and learned agents in five-player Avalon. DeepRole played with and against human players on the web in hybrid human-agent teams. We find that DeepRole outperforms human players as both a cooperator and a competitor.


Architectural Middleware that Supports Building High-performance, Scalable, Ubiquitous, Intelligent Personal Assistants

arXiv.org Artificial Intelligence

Intelligent Personal Assistants (IPAs) are software agents that can perform tasks on behalf of individuals and assist them on many of their daily activities. IPAs capabilities are expanding rapidly due to the recent advances on areas such as natural language processing, machine learning, artificial cognition, and ubiquitous computing, which equip the agents with competences to understand what users say, collect information from everyday ubiquitous devices (e.g., smartphones, wearables, tablets, laptops, cars, household appliances, etc.), learn user preferences, deliver data-driven search results, and make decisions based on user's context. Apart from the inherent complexity of building such IPAs, developers and researchers have to address many critical architectural challenges (e.g., low-latency, scalability, concurrency, ubiquity, code mobility, interoperability, support to cognitive services and reasoning, to name a few.), thereby diverting them from their main goal: building IPAs. Thus, our contribution in this paper is twofold: 1) we propose an architecture for a platform-agnostic, high-performance, ubiquitous, and distributed middleware that alleviates the burdensome task of dealing with low-level implementation details when building IPAs by adding multiple abstraction layers that hide the underlying complexity; and 2) we present an implementation of the middleware that concretizes the aforementioned architecture and allows the development of high-level capabilities while scaling the system up to hundreds of thousands of IPAs with no extra effort. We demonstrate the powerfulness of our middleware by analyzing software metrics for complexity, effort, performance, cohesion and coupling when developing a conversational IPA.