Goto

Collaborating Authors

 Agent Societies


Toward Risk-based Optimistic Exploration for Cooperative Multi-Agent Reinforcement Learning

arXiv.org Artificial Intelligence

The multi-agent setting is intricate and unpredictable since the behaviors of multiple agents influence one another. To address this environmental uncertainty, distributional reinforcement learning algorithms that incorporate uncertainty via distributional output have been integrated with multi-agent reinforcement learning (MARL) methods, achieving state-of-the-art performance. However, distributional MARL algorithms still rely on the traditional $\epsilon$-greedy, which does not take cooperative strategy into account. In this paper, we present a risk-based exploration that leads to collaboratively optimistic behavior by shifting the sampling region of distribution. Initially, we take expectations from the upper quantiles of state-action values for exploration, which are optimistic actions, and gradually shift the sampling region of quantiles to the full distribution for exploitation. By ensuring that each agent is exposed to the same level of risk, we can force them to take cooperatively optimistic actions. Our method shows remarkable performance in multi-agent settings requiring cooperative exploration based on quantile regression appropriately controlling the level of risk.


GHQ: Grouped Hybrid Q Learning for Heterogeneous Cooperative Multi-agent Reinforcement Learning

arXiv.org Artificial Intelligence

Previous deep multi-agent reinforcement learning (MARL) algorithms have achieved impressive results, typically in homogeneous scenarios. However, heterogeneous scenarios are also very common and usually harder to solve. In this paper, we mainly discuss cooperative heterogeneous MARL problems in Starcraft Multi-Agent Challenges (SMAC) environment. We firstly define and describe the heterogeneous problems in SMAC. In order to comprehensively reveal and study the problem, we make new maps added to the original SMAC maps. We find that baseline algorithms fail to perform well in those heterogeneous maps. To address this issue, we propose the Grouped Individual-Global-Max Consistency (GIGM) and a novel MARL algorithm, Grouped Hybrid Q Learning (GHQ). GHQ separates agents into several groups and keeps individual parameters for each group, along with a novel hybrid structure for factorization. To enhance coordination between groups, we maximize the Inter-group Mutual Information (IGMI) between groups' trajectories. Experiments on original and new heterogeneous maps show the fabulous performance of GHQ compared to other state-of-the-art algorithms.


Multiagent Inverse Reinforcement Learning via Theory of Mind Reasoning

arXiv.org Artificial Intelligence

We approach the problem of understanding how people interact with each other in collaborative settings, especially when individuals know little about their teammates, via Multiagent Inverse Reinforcement Learning (MIRL), where the goal is to infer the reward functions guiding the behavior of each individual given trajectories of a team's behavior during some task. Unlike current MIRL approaches, we do not assume that team members know each other's goals a priori; rather, that they collaborate by adapting to the goals of others perceived by observing their behavior, all while jointly performing a task. To address this problem, we propose a novel approach to MIRL via Theory of Mind (MIRL-ToM). For each agent, we first use ToM reasoning to estimate a posterior distribution over baseline reward profiles given their demonstrated behavior. We then perform MIRL via decentralized equilibrium by employing single-agent Maximum Entropy IRL to infer a reward function for each agent, where we simulate the behavior of other teammates according to the time-varying distribution over profiles. We evaluate our approach in a simulated 2-player search-and-rescue operation where the goal of the agents, playing different roles, is to search for and evacuate victims in the environment. Our results show that the choice of baseline profiles is paramount to the recovery of the ground-truth rewards, and that MIRL-ToM is able to recover the rewards used by agents interacting both with known and unknown teammates.


Parameter Sharing with Network Pruning for Scalable Multi-Agent Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Handling the problem of scalability is one of the essential issues for multi-agent reinforcement learning (MARL) algorithms to be applied to real-world problems typically involving massively many agents. For this, parameter sharing across multiple agents has widely been used since it reduces the training time by decreasing the number of parameters and increasing the sample efficiency. However, using the same parameters across agents limits the representational capacity of the joint policy and consequently, the performance can be degraded in multi-agent tasks that require different behaviors for different agents. In this paper, we propose a simple method that adopts structured pruning for a deep neural network to increase the representational capacity of the joint policy without introducing additional parameters. We evaluate the proposed method on several benchmark tasks, and numerical results show that the proposed method significantly outperforms other parameter-sharing methods.


From Smart Sensing to Consciousness: An info-structural model of computational consciousness for non-interacting agents

arXiv.org Artificial Intelligence

This study proposes a model of computational consciousness for non-interacting agents. The phenomenon of interest was assumed as sequentially dependent on the cognitive tasks of sensation, perception, emotion, affection, attention, awareness, and consciousness. Starting from the Smart Sensing prodromal study, the cognitive layers associated with the processes of attention, awareness, and consciousness were formally defined and tested together with the other processes concerning sensation, perception, emotion, and affection. The output of the model consists of an index that synthesizes the energetic and entropic contributions of consciousness from a computationally moral perspective. Attention was modeled through a bottom-up approach, while awareness and consciousness by distinguishing environment from subjective cognitive processes. By testing the solution on visual stimuli eliciting the emotions of happiness, anger, fear, surprise, contempt, sadness, disgust, and the neutral state, it was found that the proposed model is concordant with the scientific evidence concerning covert attention. Comparable results were also obtained regarding studies investigating awareness as a consequence of visual stimuli repetition, as well as those investigating moral judgments to visual stimuli eliciting disgust and sadness. The solution represents a novel approach for defining computational consciousness through artificial emotional activity and morality.


Ask and You Shall be Served: Representing and Solving Multi-agent Optimization Problems with Service Requesters and Providers

arXiv.org Artificial Intelligence

In scenarios with numerous emergencies that arise and require the assistance of various rescue units (e.g., medical, fire, \& police forces), the rescue units would ideally be allocated quickly and distributedly while aiming to minimize casualties. This is one of many examples of distributed settings with service providers (the rescue units) and service requesters (the emergencies) which we term \textit{service oriented settings}. Allocating the service providers in a distributed manner while aiming for a global optimum is hard to model, let alone achieve, using the existing Distributed Constraint Optimization Problem (DCOP) framework. Hence, the need for a novel approach and corresponding algorithms. We present the Service Oriented Multi-Agent Optimization Problem (SOMAOP), a new framework that overcomes the shortcomings of DCOP in service oriented settings. We evaluate the framework using various algorithms based on auctions and matching algorithms (e.g., Gale Shapely). We empirically show that algorithms based on repeated auctions converge to a high quality solution very fast, while repeated matching problems converge slower, but produce higher quality solutions. We demonstrate the advantages of our approach over standard incomplete DCOP algorithms and a greedy centralized algorithm.


Planning-Assisted Context-Sensitive Autonomous Shepherding of Dispersed Robotic Swarms in Obstacle-Cluttered Environments

arXiv.org Artificial Intelligence

Robotic shepherding is a bio-inspired approach to autonomously guiding a swarm of agents towards a desired location. The research area has earned increasing research interest recently due to the efficacy of controlling a large number of agents in a swarm (sheep) using a smaller number of actuators (sheepdogs). However, shepherding a highly dispersed swarm in an obstacle-cluttered environment remains challenging for existing methods. To improve the efficacy of shepherding in complex environments with obstacles and dispersed sheep, this paper proposes a planning-assisted context-sensitive autonomous shepherding framework with collision avoidance abilities. The proposed approach models the swarm shepherding problem as a single Travelling Salesperson Problem (TSP), with two sheepdogs\textquoteright\ modes: no-interaction and interaction. An adaptive switching approach is integrated into the framework to guide real-time path planning for avoiding collisions with static and dynamic obstacles; the latter representing moving sheep swarms. We then propose an overarching hierarchical mission planning system, which is made of three sub-systems: a clustering approach to group and distinguish sheep sub-swarms, an Ant Colony Optimisation algorithm as a TSP solver for determining the optimal herding sequence of the sub-swarms, and an online path planner for calculating optimal paths for both sheepdogs and sheep. The experiments on various environments, both with and without obstacles, objectively demonstrate the effectiveness of the proposed shepherding framework and planning approaches.


PACCART: Reinforcing Trust in Multiuser Privacy Agreement Systems

arXiv.org Artificial Intelligence

Collaborative systems, such as Online Social Networks and the Internet of Things, enable users to share privacy sensitive content. Content in these systems is often co-owned by multiple users with different privacy expectations, leading to possible multiuser privacy conflicts. In order to resolve these conflicts, various agreement mechanisms have been designed and agents that could participate in such mechanisms have been proposed. However, research shows that users hesitate to use software tools for managing their privacy. To remedy this, we argue that users should be supported by trustworthy agents that adhere to the following criteria: (i) concealment of privacy preferences, such that only necessary information is shared with others, (ii) equity of treatment, such that different kinds of users are supported equally, (iii) collaboration of users, such that a group of users can support each other in agreement and (iv) explainability of actions, such that users know why certain information about them was shared to reach a decision. Accordingly, this paper proposes PACCART, an open-source agent that satisfies these criteria. Our experiments over simulations and user study indicate that PACCART increases user trust significantly.


Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning

arXiv.org Artificial Intelligence

In large-scale machine learning, recent works have studied the effects of compressing gradients in stochastic optimization in order to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in large-scale, multi-agent reinforcement learning, almost nothing is known about the analogous question: Are common reinforcement learning (RL) algorithms also robust to similar perturbations? In this paper, we investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our main technical contribution is to show that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. We then extend our results significantly to nonlinear stochastic approximation algorithms and multi-agent settings. In particular, we prove that for multi-agent TD learning, one can achieve linear convergence speedups in the number of agents while communicating just $\tilde{O}(1)$ bits per agent at each time step. Our work is the first to provide finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our analysis hinges on studying the drift of a novel Lyapunov function that captures the dynamics of a memory variable introduced by error feedback.


Order Matters: Agent-by-agent Policy Optimization

arXiv.org Artificial Intelligence

While multi-agent trust region algorithms have achieved great success empirically in solving coordination tasks, most of them, however, suffer from a non-stationarity problem since agents update their policies simultaneously. In contrast, a sequential scheme that updates policies agent-by-agent provides another perspective and shows strong performance. However, sample inefficiency and lack of monotonic improvement guarantees for each agent are still the two significant challenges for the sequential scheme. In this paper, we propose the \textbf{A}gent-by-\textbf{a}gent \textbf{P}olicy \textbf{O}ptimization (A2PO) algorithm to improve the sample efficiency and retain the guarantees of monotonic improvement for each agent during training. We justify the tightness of the monotonic improvement bound compared with other trust region algorithms. From the perspective of sequentially updating agents, we further consider the effect of agent updating order and extend the theory of non-stationarity into the sequential update scheme. To evaluate A2PO, we conduct a comprehensive empirical study on four benchmarks: StarCraftII, Multi-agent MuJoCo, Multi-agent Particle Environment, and Google Research Football full game scenarios. A2PO consistently outperforms strong baselines.