Agents
On Multi-Agent Learning in Team Sports Games
Zhao, Yunqi, Borovikov, Igor, Rupert, Jason, Somers, Caedmon, Beirami, Ahmad
In recent years, reinforcement learning has been successful in solving video games from Atari to Star Craft II. However, the end-to-end model-free reinforcement learning (RL) is not sample efficient and requires a significant amount of computational resources to achieve superhuman level performance. Model-free RL is also unlikely to produce human-like agents for playtesting and gameplaying AI in the development cycle of complex video games. In this paper, we present a hierarchical approach to training agents with the goal of achieving human-like style and high skill level in team sports games. While this is still work in progress, our preliminary results show that the presented approach holds promise for solving the posed multi-agent learning problem.
Multi-Agent Deep Reinforcement Learning for Liquidation Strategy Analysis
Liquidation is the process of selling a large number of shares of one stock sequentially within a given time frame, taking into consideration the costs arising from market impact and a trader's risk aversion. The main challenge in optimizing liquidation is to find an appropriate modeling system that can incorporate the complexities of the stock market and generate practical trading strategies. In this paper, we propose to use multi-agent deep reinforcement learning model, which better captures high-level complexities comparing to various machine learning methods, such that agents can learn how to make the best selling decisions. First, we theoretically analyze the Almgren and Chriss model and extend its fundamental mechanism so it can be used as the multi-agent trading environment. Our work builds the foundation for future multi-agent environment trading analysis. Secondly, we analyze the cooperative and competitive behaviours between agents by adjusting the reward functions for each agent, which overcomes the limitation of single-agent reinforcement learning algorithms. Finally, we simulate trading and develop an optimal trading strategy with practical constraints by using a reinforcement learning method, which shows the capabilities of reinforcement learning methods in solving realistic liquidation problems.
Modeling Multi-Vehicle Interaction Scenarios Using Gaussian Random Field
Guo, Yaohui, Kalidindi, Vinay Varma, Arief, Mansur, Wang, Wenshuo, Zhu, Jiacheng, Peng, Huei, Zhao, Ding
Autonomous vehicles (AV) are expected to navigate in complex traffic scenarios with multiple surrounding vehicles. The correlations between road users vary over time, the degree of which, in theory, could be infinitely large, and thus posing a great challenge in modeling and predicting the driving environment. In this research, we propose a method to reproduce such high-dimensional scenarios in a finitely tractable form by defining a stochastic vector field model in multi-vehicle interactions. We then apply non-parametric Bayesian learning to extract the underlying motion patterns from a large quantity of naturalistic traffic data. We use Gaussian process to model multi-vehicle motion, and Dirichlet process to assign each observation to a specific scenario. We implement the proposed method on NGSim highway and intersection data sets, in which complex multi-vehicle interactions are prevalent. The results show that the proposed method is capable of capturing motion patterns from both settings, without imposing heroic prior, hence can be applied for a wide array of traffic situations. The proposed modeling can enable simulation platforms and other testing methods designed for AV evaluation, to easily model and generate traffic scenarios emulating large scale driving data.
House Markets and Single-Peaked Preferences: From Centralized to Decentralized Allocation Procedures
Beynier, Aurรฉlie, Maudet, Nicolas, Rey, Simon, Shams, Parham
Recently, the problem of allocating one resource per agent with initial endowments (\emph{house markets}) has seen a renewed interest: indeed, while in the general domain Top Trading Cycle is known to be the only procedure guaranteeing Pareto-optimality, individual rationality, and strategy proofness, the situation differs in single-peaked domains. Bade (2019) presented the Crawler, an alternative procedure enjoying the same properties (with the additional advantage of being implementable in obviously dominant strategies); while Damamme et al. (2015) showed that allowing mutually beneficial swap-deals among the agents was already enough to guarantee Pareto-optimality. In this paper we significantly deepen our understanding of this decentralized procedures: we show in particular that the single-peaked domains happen to be ``maximal'' if one wishes to guarantee this convergence property. Interestingly, we also observe that the set of allocations reachable by swap-deals always contains the outcome of the Crawler. To further investigate how these different mechanisms compare, we pay special attention to the average and minimum rank of the resource obtained by the agents in the outcome allocation. We provide theoretical bounds on the loss potentially induced by these procedures with respect to these criteria, and complement these results with an extensive experimental study which shows how different variants of swap dynamics behave. In fact, even the simplest dynamics exhibit very good results, and it is possible to further guide the process towards our objectives, if one is ready to sacrifice a bit in terms of decentralization. On our way, we also show that a simple variant of the Crawler allows to check efficiently that an allocation is Pareto-optimal in single-peaked domains.
Foolproof Cooperative Learning
Jacq, Alexis, Perolat, Julien, Geist, Matthieu, Pietquin, Olivier
This paper extends the notion of equilibrium in game theory to learning algorithms in repeated stochastic games. We define a learning equilibrium as an algorithm used by a population of players, such that no player can individually use an alternative algorithm and increase its asymptotic score. We introduce Foolproof Cooperative Learning (FCL), an algorithm that converges to a Tit-for-Tat behavior. It allows cooperative strategies when played against itself while being not exploitable by selfish players. We prove that in repeated symmetric games, this algorithm is a learning equilibrium. We illustrate the behavior of FCL on symmetric matrix and grid games, and its robustness to selfish learners.
3D Multi-Robot Patrolling with a Two-Level Coordination Strategy
Freda, Luigi, Gianni, Mario, Pirri, Fiora, Gawel, Abel, Dube, Renaud, Siegwart, Roland, Cadena, Cesar
Teams of UGVs patrolling harsh and complex 3D environments can experience interference and spatial conflicts with one another. Neglecting the occurrence of these events crucially hinders both soundness and reliability of a patrolling process. This work presents a distributed multi-robot patrolling technique, which uses a two-level coordination strategy to minimize and explicitly manage the occurrence of conflicts and interference. The first level guides the agents to single out exclusive target nodes on a topological map. This target selection relies on a shared idleness representation and a coordination mechanism preventing topological conflicts. The second level hosts coordination strategies based on a metric representation of space and is supported by a 3D SLAM system. Here, each robot path planner negotiates spatial conflicts by applying a multi-robot traversability function. Continuous interactions between these two levels ensure coordination and conflicts resolution. Both simulations and real-world experiments are presented to validate the performances of the proposed patrolling strategy in 3D environments. Results show this is a promising solution for managing spatial conflicts and preventing deadlocks.
From drone swarms to AI border guards: How futuristic technology could be used to police Britain's borders
Whether it is the Irish backstop or English Channel, the issue of how the UK and Europe are controlling their borders has been thrust into the public consciousness. And as with many of the globe's conundrums, countries and private companies are turning to ever more futuristic, and often controversial, technologies in order to protect their borders. There are, of course, immediate issues for Britain's borders with quandaries such as the potential hard border in Northern Ireland following Brexit, with the nebulous'technology' promised by some politicians either still being developed or put under question. One such future proposal is a satellite system that registered mobile phones as they pass the border, while sensors buried in the ground or radars on flying drones could detect possible unlawful breaches of the boundaries. But that would still leave the question of invasive, even if largely invisible, checks that run against the Good Friday Agreement.
Leveraging Reinforcement Learning Techniques for Effective Policy Adoption and Validation
Kuang, Nikki Lijing, Leung, Clement H. C.
Rewards and punishments in different forms are pervasive and present in a wide variety of decision-making scenarios. By observing the outcome of a sufficient number of repeated trials, one would gradually learn the value and usefulness of a particular policy or strategy. However, in a given environment, the outcomes resulting from different trials are subject to chance influence and variations. In learning about the usefulness of a given policy, significant costs are involved in systematically undertaking the sequential trials; therefore, in most learning episodes, one would wish to keep the cost within bounds by adopting learning stopping rules. In this paper, we examine the deployment of different stopping strategies in given learning environments which vary from highly stringent for mission critical operations to highly tolerant for non-mission critical operations, and emphasis is placed on the former with particular application to aviation safety. In policy evaluation, two sequential phases of learning are identified, and we describe the outcomes variations using a probabilistic model, with closedform expressions obtained for the key measures of performance. Decision rules that map the trial observations to policy choices are also formulated. In addition, simulation experiments are performed, which corroborate the validity of the theoretical results.
Categorizing Wireheading in Partially Embedded Agents
Majha, Arushi, Sarkar, Sayan, Zagami, Davide
$\textit{Embedded agents}$ are not explicitly separated from their environment, lacking clear I/O channels. Such agents can reason about and modify their internal parts, which they are incentivized to shortcut or $\textit{wirehead}$ in order to achieve the maximal reward. In this paper, we provide a taxonomy of ways by which wireheading can occur, followed by a definition of wirehead-vulnerable agents. Starting from the fully dualistic universal agent AIXI, we introduce a spectrum of partially embedded agents and identify wireheading opportunities that such agents can exploit, experimentally demonstrating the results with the GRL simulation platform AIXIjs. We contextualize wireheading in the broader class of all misalignment problems - where the goals of the agent conflict with the goals of the human designer - and conjecture that the only other possible type of misalignment is specification gaming. Motivated by this taxonomy, we define wirehead-vulnerable agents as embedded agents that choose to behave differently from fully dualistic agents lacking access to their internal parts.
When Multiple Agents Learn to Schedule: A Distributed Radio Resource Management Framework
Naderializadeh, Navid, Sydir, Jaroslaw, Simsek, Meryem, Nikopour, Hosein, Talwar, Shilpa
Interference among concurrent transmissions in a wireless network is a key factor limiting the system performance. One way to alleviate this problem is to manage the radio resources in order to maximize either the average or the worst-case performance. However, joint consideration of both metrics is often neglected as they are competing in nature. In this article, a mechanism for radio resource management using multi-agent deep reinforcement learning (RL) is proposed, which strikes the right trade-off between maximizing the average and the $5^{th}$ percentile user throughput. Each transmitter in the network is equipped with a deep RL agent, receiving partial observations from the network (e.g., channel quality, interference level, etc.) and deciding whether to be active or inactive at each scheduling interval for given radio resources, a process referred to as link scheduling. Based on the actions of all agents, the network emits a reward to the agents, indicating how good their joint decisions were. The proposed framework enables the agents to make decisions in a distributed manner, and the reward is designed in such a way that the agents strive to guarantee a minimum performance, leading to a fair resource allocation among all users across the network. Simulation results demonstrate the superiority of our approach compared to decentralized baselines in terms of average and $5^{th}$ percentile user throughput, while achieving performance close to that of a centralized exhaustive search approach. Moreover, the proposed framework is robust to mismatches between training and testing scenarios. In particular, it is shown that an agent trained on a network with low transmitter density maintains its performance and outperforms the baselines when deployed in a network with a higher transmitter density.