Agents
Value Variance Minimization for Learning Approximate Equilibrium in Aggregation Systems
Verma, Tanvi, Varakantham, Pradeep
For effective matching of resources (e.g., taxis, food, bikes, shopping items) to customer demand, aggregation systems have been extremely successful. In aggregation systems, a central entity (e.g., Uber, Food Panda, Ofo) aggregates supply (e.g., drivers, delivery personnel) and matches demand to supply on a continuous basis (sequential decisions). Due to the objective of the central entity to maximize its profits, individual suppliers get sacrificed thereby creating incentive for individuals to leave the system. In this paper, we consider the problem of learning approximate equilibrium solutions (win-win solutions) in aggregation systems, so that individuals have an incentive to remain in the aggregation system. Unfortunately, such systems have thousands of agents and have to consider demand uncertainty and the underlying problem is a (Partially Observable) Stochastic Game. Given the significant complexity of learning or planning in a stochastic game, we make three key contributions: (a) To exploit infinitesimally small contribution of each agent and anonymity (reward and transitions between agents are dependent on agent counts) in interactions, we represent this as a Multi-Agent Reinforcement Learning (MARL) problem that builds on insights from non-atomic congestion games model; (b) We provide a novel variance reduction mechanism for moving joint solution towards Nash Equilibrium that exploits the infinitesimally small contribution of each agent; and finally (c) We provide detailed results on three different domains to demonstrate the utility of our approach in comparison to state-of-the-art methods.
A Job-Assignment Heuristic for Lifelong Multi-Agent Path Finding Problem with Multiple Delivery Locations
Multi-agent path finding (MAPF) algorithms are offline methods intended to find conflict-free paths for more than one agent. However, for many real-life applications, this problem description is inadequate for representing the needs of the domain. To address this issue we worked on a lifelong variation in which agents can have more than one ordered destination. New destinations can be inserted into the system anytime after the initial job-assignment has been made, and these new destinations must also be assigned to agents, and the time of visiting the new destination must also be determined. We called this Lifelong Multi-Agent Path Finding with Multiple Delivery Locations (MAPF-MD). To solve this problem we introduced the Multiple Delivery Conflict-Based Search algorithm (MD-DCBS). We used D*-lite in the low-level search of CBS to benefit from the D*-lite's incremental nature in achieving a performance increase in the CBS search. To handle multiple delivery locations we define multiple D*-lite instances for each agent. The aggregations of all of the paths produced by the D*-lite instances constitute the path of that agent. After that we run CBS on aggregated paths. In this problem we introduced the Multiple Delivery Conflict-Based Search algorithm (MD-DCBS). We used D*-lite in the low-level search of CBS to benefit from the D*-lite's incremental nature in achieving a performance increase in the CBS search. To handle multiple delivery locations we define multiple D*-lite instances for each agent. The aggregations of all of the paths produced by the D*-lite instances constitute the path of that agent. After that we run CBS on aggregated paths. We have shown that this version solves MAPF-MD instances correctly. We also proposed multiple job-assignment heuristics to generate low-total-cost solutions and determined the best performing method amongst them.
Artificial Intelligence Crime: An Interdisciplinary Analysis of Foreseeable Threats and Solutions
Artificial intelligence (AI) may play an increasingly essentialFootnote 1 role in criminal acts in the future. Criminal acts are defined here as any act (or omission) constituting an offence punishable under English criminal law,Footnote 2 without loss of generality to jurisdictions that similarly define crime. Evidence of "AI-Crime" (AIC) is provided by two (theoretical) research experiments. In the first one, two computational social scientists (Seymour and Tully 2016) used AI as an instrument to convince social media users to click on phishing links within mass-produced messages. Because each message was constructed using machine learning techniques applied to users' past behaviours and public profiles, the content was tailored to each individual, thus camouflaging the intention behind each message. If the potential victim had clicked on the phishing link and filled in the subsequent web-form, then (in real-world circumstances) a criminal would have obtained personal and private information that could be used for theft and fraud. AI-fuelled crime may also impact commerce. In the second experiment, three computer scientists (Martínez-Miranda et al. 2016) simulated a market and found that trading agents could learn and execute a "profitable" market manipulation campaign comprising a set of deceitful false-orders. These two experiments show that AI provides a feasible and fundamentally novel threat, in the form of AIC. The importance of AIC as a distinct phenomenon has not yet been acknowledged. The literature on AI's ethical and social implications focuses on regulating and controlling AI's civil uses, rather than considering its possible role in crime (Kerr 2004).
Cooperation without Coordination: Hierarchical Predictive Planning for Decentralized Multiagent Navigation
Wang, Rose E., Kew, J. Chase, Lee, Dennis, Lee, Tsang-Wei Edward, Zhang, Tingnan, Ichter, Brian, Tan, Jie, Faust, Aleksandra
Decentralized multiagent planning raises many challenges, such as adaption to changing environments inexplicable by the agent's own behavior, coordination from noisy sensor inputs like lidar, cooperation without knowing other agents' intents. To address these challenges, we present hierarchical predictive planning (HPP) for decentralized multiagent navigation tasks. HPP learns prediction models for itself and other teammates, and uses the prediction models to propose and evaluate navigation goals that complete the cooperative task without explicit coordination. To learn the prediction models, HPP observes other agents' behavior and learns to maps own sensors to predicted locations of other agents. HPP then uses the cross-entropy method to iteratively propose, evaluate, and improve navigation goals, under assumption that all agents in the team share a common objective. HPP removes the need for a centralized operator (i.e. robots determine their own actions without coordinating their beliefs or plans) and can be trained and easily transferred to real world environments. The results show that HPP generalizes to new environments including real-world robot team. It is also 33x more sample efficient and performs better in complex environments compared to a baseline. The video and website for this paper can be found at https://youtu.be/-LqgfksqNH8 and https://sites.google.com/view/multiagent-hpp.
Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments
Lowe, Ryan, Wu, Yi, Tamar, Aviv, Harb, Jean, Abbeel, Pieter, Mordatch, Igor
We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.
Explaining the Punishment Gap of AI and Robots
Lima, Gabriel, Cha, Meeyoung, Jeon, Chihyung, Park, Kyungsin
The European Parliament's proposal to create a new legal status for artificial intelligence (AI) and robots brought into focus the idea of electronic legal personhood. This discussion, however, is hugely controversial. While some scholars argue that the proposed status could contribute to the coherence of the legal system, others say that it is neither beneficial nor desirable. Notwithstanding this prospect, we conducted a survey (N=3315) to understand online users' perceptions of the legal personhood of AI and robots. We observed how the participants assigned responsibility, awareness, and punishment to AI, robots, humans, and various entities that could be held liable under existing doctrines. We also asked whether the participants thought that punishing electronic agents fulfills the same legal and social functions as human punishment. The results suggest that even though people do not assign any mental state to electronic agents and are not willing to grant AI and robots physical independence or assets, which are the prerequisites of criminal or civil liability, they do consider them responsible for their actions and worthy of punishment. The participants also did not think that punishment or liability of these entities would achieve the primary functions of punishment, leading to what we define as the punishment gap. Therefore, before we recognize electronic legal personhood, we must first discuss proper methods of satisfying the general population's demand for punishment.
Accelerating and Improving AlphaZero Using Population Based Training
Wu, Ti-Rong, Wei, Ting-Han, Wu, I-Chen
AlphaZero has been very successful in many games. Unfortunately, it still consumes a huge amount of computing resources, the majority of which is spent in self-play. Hyperparameter tuning exacerbates the training cost since each hyperparameter configuration requires its own time to train one run, during which it will generate its own self-play records. As a result, multiple runs are usually needed for different hyperparameter configurations. This paper proposes using population based training (PBT) to help tune hyperparameters dynamically and improve strength during training time. Another significant advantage is that this method requires a single run only, while incurring a small additional time cost, since the time for generating self-play records remains unchanged though the time for optimization is increased following the AlphaZero training algorithm. In our experiments for 9x9 Go, the PBT method is able to achieve a higher win rate for 9x9 Go than the baselines, each with its own hyperparameter configuration and trained individually. For 19x19 Go, with PBT, we are able to obtain improvements in playing strength. Specifically, the PBT agent can obtain up to 74% win rate against ELF OpenGo, an open-source state-of-the-art AlphaZero program using a neural network of a comparable capacity. This is compared to a saturated non-PBT agent, which achieves a win rate of 47% against ELF OpenGo under the same circumstances.
Robust Multi-Agent Path Finding and Executing
Atzmon, Dor (Ben-Gurion University) | Stern, Roni (Ben-Gurion University) | Felner, Ariel (Ben-Gurion University) | Wagner, Glenn (Carnegie Mellon University) | Barták, Roman (Charles University) | Zhou, Neng-Fa (CUNY Brooklyn College)
Multi-agent path-finding (MAPF) is the problem of finding a plan for moving a set of agents from their initial locations to their goals without collisions. Following this plan, however, may not be possible due to unexpected events that delay some of the agents. In this work, we propose a holistic solution for MAPF that is robust to such unexpected delays. First, we introduce the notion of a k-robust MAPF plan, which is a plan that can be executed even if a limited number (k) of delays occur. We propose sufficient and required conditions for finding a k-robust plan, and show how to convert several MAPF solvers to find such plans. Then, we propose several robust execution policies. An execution policy is a policy for agents executing a MAPF plan. An execution policy is robust if following it guarantees that the agents reach their goals even if they encounter unexpected delays. Several classes of such robust execution policies are proposed and evaluated experimentally. Finally, we present robust execution policies for cases where communication between the agents may also be delayed. We performed an extensive experimental evaluation in which we compared different algorithms for finding robust MAPF plans, compared different ro- bust execution policies, and studied the interplay between having a robust plan and the performance when using a robust execution policy.
A General Framework for Learning Mean-Field Games
Guo, Xin, Hu, Anran, Xu, Renyuan, Zhang, Junzi
This paper is motivated by the following Ad auction problem for an advertiser. An Ad auction is a stochastic game on an Ad exchange platform among a large number of players, the advertisers. In between the time a web user requests a page and the time the page is displayed, usually within a millisecond, a Vickrey-type of second-best-price auction is run to incentivize interested advertisers to bid for an Ad slot to display advertisement. Each advertiser has limited information before each bid: first, her own valuation for a slot depends on an unknown conversion of clicks for the item; secondly, she, should she win the bid, only knows the reward after the user's activities on the website are finished. In addition, she has a budget constraint in this repeated auction. The question is, how should she bid in this online sequential repeated game when there is a large population of bidders competing on the Ad platform, with unknown distributions of the conversion of clicks and rewards? Besides the Ad auction, there are many real-world problems involving a large number of players and unknown systems. Examples include massive multi-player online roleplaying games [30], high frequency tradings [35], and the sharing economy [24].
Research Directions for Developing and Operating Artificial Intelligence Models in Trustworthy Autonomous Systems
Martínez-Fernández, Silverio, Franch, Xavier, Jedlitschka, Andreas, Oriol, Marc, Trendowicz, Adam
Context: Autonomous Systems (ASs) are becoming increasingly pervasive in today's society. One reason lies in the emergence of sophisticated Artificial Intelligence (AI) solutions that boost the ability of ASs to self-adapt in increasingly complex and dynamic environments. Companies dealing with AI models in ASs face several problems, such as users' lack of trust in adverse or unknown conditions, and gaps between systems engineering and AI model development and evolution in a continuously changing operational environment. Objective: This vision paper aims to close the gap between the development and operation of trustworthy AI-based ASs by defining a process that coordinates both activities. Method: We synthesize the main challenges of AI-based ASs in industrial settings. To overcome such challenges, we propose a novel, holistic DevOps approach and reflect on the research efforts required to put it into practice. Results: The approach sets up five critical research directions: (a) a trustworthiness score to monitor operational AI-based ASs and identify self-adaptation needs in critical situations; (b) an integrated agile process for the development and continuous evolution of AI models; (c) an infrastructure for gathering key feedback required to address the trustworthiness of AI models at operation time; (d) continuous and seamless deployment of different context-specific instances of AI models in a distributed setting of ASs; and (e) a holistic and effective DevOps-based lifecycle for AI-based ASs. Conclusions: An approach supporting the continuous delivery of evolving AI models and their operation in ASs under adverse conditions would support companies in increasing users' trust in their products.