Goto

Collaborating Authors

 Agents


Meta-control of social learning strategies

arXiv.org Artificial Intelligence

Social learning, copying other's behavior without actual experience, offers a cost-effective means of knowledge acquisition. However, it raises the fundamental question of which individuals have reliable information: successful individuals versus the majority. The former and the latter are known respectively as success-based and conformist social learning strategies. We show here that while the success-based strategy fully exploits the benign environment of low uncertainly, it fails in uncertain environments. On the other hand, the conformist strategy can effectively mitigate this adverse effect. Based on these findings, we hypothesized that meta-control of individual and social learning strategies provides effective and sample-efficient learning in volatile and uncertain environments. Simulations on a set of environments with various levels of volatility and uncertainty confirmed our hypothesis. The results imply that meta-control of social learning affords agents the leverage to resolve environmental uncertainty with minimal exploration cost, by exploiting others' learning as an external knowledge base.


Many Agent Reinforcement Learning Under Partial Observability

arXiv.org Artificial Intelligence

Recent renewed interest in multi-agent reinforcement learning (MARL) has generated an impressive array of techniques that leverage deep reinforcement learning, primarily actor-critic architectures, and can be applied to a limited range of settings in terms of observability and communication. However, a continuing limitation of much of this work is the curse of dimensionality when it comes to representations based on joint actions, which grow exponentially with the number of agents. In this paper, we squarely focus on this challenge of scalability. We apply the key insight of action anonymity, which leads to permutation invariance of joint actions, to two recently presented deep MARL algorithms, MADDPG and IA2C, and compare these instantiations to another recent technique that leverages action anonymity, viz., mean-field MARL. We show that our instantiations can learn the optimal behavior in a broader class of agent networks than the mean-field method, using a recently introduced pragmatic domain.


Optimizing robotic swarm based construction tasks

arXiv.org Artificial Intelligence

Social insects in nature such as ants, termites and bees construct their colonies collaboratively in a very efficient process. In these swarms, each insect contributes to the construction task individually showing redundant and parallel behavior of individual entities. But the robotics adaptations of these swarm's behaviors haven't yet made it to the real world at a large enough scale of commonly being used due to the limitations in the existing approaches to the swarm robotics construction. This paper presents an approach that combines the existing swarm construction approaches which results in a swarm robotic system, capable of constructing a given 2 dimensional shape in an optimized manner.


Hi-Phy: A Benchmark for Hierarchical Physical Reasoning

arXiv.org Artificial Intelligence

Reasoning about the behaviour of physical objects is a key capability of agents operating in physical worlds. Humans are very experienced in physical reasoning while it remains a major challenge for AI. To facilitate research addressing this problem, several benchmarks have been proposed recently. However, these benchmarks do not enable us to measure an agent's granular physical reasoning capabilities when solving a complex reasoning task. In this paper, we propose a new benchmark for physical reasoning that allows us to test individual physical reasoning capabilities. Inspired by how humans acquire these capabilities, we propose a general hierarchy of physical reasoning capabilities with increasing complexity. Our benchmark tests capabilities according to this hierarchy through generated physical reasoning tasks in the video game Angry Birds. This benchmark enables us to conduct a comprehensive agent evaluation by measuring the agent's granular physical reasoning capabilities. We conduct an evaluation with human players, learning agents, and heuristic agents and determine their capabilities. Our evaluation shows that learning agents, with good local generalization ability, still struggle to learn the underlying physical reasoning capabilities and perform worse than current state-of-the-art heuristic agents and humans. We believe that this benchmark will encourage researchers to develop intelligent agents with advanced, human-like physical reasoning capabilities. URL: https://github.com/Cheng-Xue/Hi-Phy


Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers

arXiv.org Artificial Intelligence

Outside of normal form (NF) games, this problem setting Two-player, constant-sum games are well studied arises in multi-agent training when dealing with empirical in the literature, but there has been limited games (also called meta-games), where a game payoff progress outside of this setting. We propose Joint tensor is populated with expected outcomes between Policy-Space Response Oracles (JPSRO), an algorithm agents playing an extensive form (EF) game, for example for training agents in n-player, general-sum the StarCraft League (Vinyals et al., 2019) and Policy-Space extensive form games, which provably converges Response Oracles (PSRO) (Lanctot et al., 2017), a recent to an equilibrium. We further suggest correlated variant of which reached state-of-the-art results in Stratego equilibria (CE) as promising meta-solvers, and Barrage (McAleer et al., 2020).


Cooperative Multi-Agent Reinforcement Learning Based Distributed Dynamic Spectrum Access in Cognitive Radio Networks

arXiv.org Artificial Intelligence

This work has been submitted to the IEEE for possible publication. Abstract With the development of the 5G and Internet of Things, amounts of wireless devices need to share the limited spectrum resources. Dynamic spectrum access (DSA) is a promising paradigm to remedy the problem of inef!cient spectrum utilization brought upon by the historical command-and-control approach to spectrum allocation. In this paper, we investigate the distributed DSA problem for multiuser in a typical multi-channel cognitive radio network. The problem is formulated as a decentralized partially observable Markov decision process (Dec-POMDP), and we proposed a centralized off-line training and distributed on-line execution framework based on cooperative multi-agent reinforcement learning (MARL). We employ the deep recurrent Q-network (DRQN) to address the partial observability of the state for each cognitive user. The ultimate goal is to learn a cooperative strategy which maximizes the sum throughput of cognitive radio network in distributed fashion without coordination information exchange between cognitive users. This work was supported in part by the National Natural Science Foundation of China under Grant 6193000305. X. Tan, L. Zhou, Y. Sun, H. Wang, H. Zhao and J. Wei are all with College of Electronic Science and Technology, National University of Defense Technology, Changsha, 410073, China (E-mail: {tanxiang, zhouli2035, haijunwang14, sunyuli19, haitaozhao, wjbhw}@nudt.edu.cn). Boon-Chong Seet is with the Department of Electrical and Electronic Engineering, Auckland University of Technology, Auckland 1142, New Zealand (E-mail: boon-chong.seet@aut.ac.nz). Victor C. M. Leung is with Shenzhen University, Shenzhen, China and the University of British Columbia, Vancouver, Canada (E-mail: vleung@ieee.org). 2 From the simulation results, we can observe that the proposed algorithm can converge fast and achieve almost the optimal performance. The future network is involving into the Internet of Everything.


Seeing Differently, Acting Similarly: Imitation Learning with Heterogeneous Observations

arXiv.org Artificial Intelligence

In many real-world imitation learning tasks, the demonstrator and the learner have to act in different but full observation spaces. This situation generates significant obstacles for existing imitation learning approaches to work, even when they are combined with traditional space adaptation techniques. The main challenge lies in bridging expert's occupancy measures to learner's dynamically changing occupancy measures under the different observation spaces. In this work, we model the above learning problem as Heterogeneous Observations Imitation Learning (HOIL). We propose the Importance Weighting with REjection (IWRE) algorithm based on the techniques of importance-weighting, learning with rejection, and active querying to solve the key challenge of occupancy measure matching. Experimental results show that IWRE can successfully solve HOIL tasks, including the challenging task of transforming the vision-based demonstrations to random access memory (RAM)-based policies under the Atari domain.


A discrete optimisation approach for target path planning whilst evading sensors

arXiv.org Artificial Intelligence

In this paper we deal with a practical problem that arises in military situations. The problem is to plan a path for one (or more) agents to reach a target without being detected by enemy sensors. Agents are not passive, rather they can (within limits) initiate actions which aid evasion, namely knockout (completely disable sensors) and confusion (reduce sensor detection probabilities). Agent actions are path dependent and time limited. Here by path dependent we mean that an agent needs to be sufficiently close to a sensor to knock it out. By time limited we mean that a limit is imposed on how long a sensor is knocked out or confused before it reverts back to its original operating state. The approach adopted breaks the continuous space in which agents move into a discrete space. This enables the problem to be represented (formulated) mathematically as a zero-one integer program with linear constraints. The advantage of representing the problem in this manner is that powerful commercial software optimisation packages exist to solve the problem to proven global optimality. Computational results are presented for a number of randomly generated test problems.


Targeted Data Acquisition for Evolving Negotiation Agents

arXiv.org Artificial Intelligence

Consider a standard non-cooperative negotiation game (Deming et al., 1944; Successful negotiators must learn how to balance Nash, 1950; 1951) as shown in Figure 1 where two agents - optimizing for self-interest and cooperation. Yet Alice and Bob - are trying to agree on an allocation of shared current artificial negotiation agents often heavily resources. Both have high utility associated with the hats depend on the quality of the static datasets they and balls, though Alice also cares about books. Effectively were trained on, limiting their capacity to fashion employing negotiation is crucial, and is the only way to an adaptive response balancing self-interest and reach an equitable outcome - dividing the hats and balls cooperation. For this reason, we find that these evenly, while giving Alice the book. Even where negotiating agents can achieve either high utility or cooperation, agents have incentives that make it challenging for them to but not both. To address this, we introduce cooperate, it would be difficult to imagine that negotiation a targeted data acquisition framework where we could be useful to agents over time -- let alone society -- guide the exploration of a reinforcement learning if agents were incapable of cooperating to achieve equitable agent using annotations from an expert oracle.


Learned Belief Search: Efficiently Improving Policies in Partially Observable Settings

arXiv.org Artificial Intelligence

Search is an important tool for computing effective policies in single- and multi-agent environments, and has been crucial for achieving superhuman performance in several benchmark fully and partially observable games. However, one major limitation of prior search approaches for partially observable environments is that the computational cost scales poorly with the amount of hidden information. In this paper we present \emph{Learned Belief Search} (LBS), a computationally efficient search procedure for partially observable environments. Rather than maintaining an exact belief distribution, LBS uses an approximate auto-regressive counterfactual belief that is learned as a supervised task. In multi-agent settings, LBS uses a novel public-private model architecture for underlying policies in order to efficiently evaluate these policies during rollouts. In the benchmark domain of Hanabi, LBS can obtain 55% ~ 91% of the benefit of exact search while reducing compute requirements by $35.8 \times$ ~ $4.6 \times$, allowing it to scale to larger settings that were inaccessible to previous search methods.