Agents
Hierarchically Structured Scheduling and Execution of Tasks in a Multi-Agent Environment
Carvalho, Diogo S., Sengupta, Biswa
In a warehouse environment, tasks appear dynamically. Consequently, a task management system that matches them with the workforce too early (e.g., weeks in advance) is necessarily sub-optimal. Also, the rapidly increasing size of the action space of such a system consists of a significant problem for traditional schedulers. Reinforcement learning, however, is suited to deal with issues requiring making sequential decisions towards a long-term, often remote, goal. In this work, we set ourselves on a problem that presents itself with a hierarchical structure: the task-scheduling, by a centralised agent, in a dynamic warehouse multi-agent environment and the execution of one such schedule, by decentralised agents with only partial observability thereof. We propose to use deep reinforcement learning to solve both the high-level scheduling problem and the low-level multi-agent problem of schedule execution. Finally, we also conceive the case where centralisation is impossible at test time and workers must learn how to cooperate in executing the tasks in an environment with no schedule and only partial observability.
How to Train your Decision-Making AIs
The combination of deep learning and decision learning has led to several impressive stories in decision-making AI research, including AIs that can play a variety of games (Atari video games, board games, complex real-time strategy game Starcraft II), control robots (in simulation and in the real world), and even fly a weather balloon. These are examples of sequential decision tasks, in which the AI agent needs to make a sequence of decisions to achieve its goal. Today, the two main approaches for training such agents are reinforcement learning (RL) and imitation learning (IL). In reinforcement learning, humans provide rewards for completing discrete tasks, with the rewards typically being delayed and sparse. For example, 100 points are given for solving the first room of Montezuma's revenge (Fig.1). In the imitation learning setting, humans can transfer knowledge and skills through step-by-step action demonstrations (Fig.2), and the agent then learns to mimic human actions.
Fully Decentralized, Scalable Gaussian Processes for Multi-Agent Federated Learning
Kontoudis, George P., Stilwell, Daniel J.
In this paper, we propose decentralized and scalable algorithms for Gaussian process (GP) training and prediction in multi-agent systems. To decentralize the implementation of GP training optimization algorithms, we employ the alternating direction method of multipliers (ADMM). A closed-form solution of the decentralized proximal ADMM is provided for the case of GP hyper-parameter training with maximum likelihood estimation. Multiple aggregation techniques for GP prediction are decentralized with the use of iterative and consensus methods. In addition, we propose a covariance-based nearest neighbor selection strategy that enables a subset of agents to perform predictions. The efficacy of the proposed methods is illustrated with numerical experiments on synthetic and real data.
DEC-LOS-RRT: Decentralized Path Planning for Multi-robot Systems with Line-of-sight Constrained Communication
Tuck, Victoria, Pant, Yash Vardhan, Seshia, Sanjit A., Sastry, S. Shankar
Decentralized planning for multi-agent systems, such as fleets of robots in a search-and-rescue operation, is often constrained by limitations on how agents can communicate with each other. One such limitation is the case when agents can communicate with each other only when they are in line-of-sight (LOS). Developing decentralized planning methods that guarantee safety is difficult in this case, as agents that are occluded from each other might not be able to communicate until it's too late to avoid a safety violation. In this paper, we develop a decentralized planning method that explicitly avoids situations where lack of visibility of other agents would lead to an unsafe situation. Building on top of an existing Rapidly-exploring Random Tree (RRT)-based approach, our method guarantees safety at each iteration. Simulation studies show the effectiveness of our method and compare the degradation in performance with respect to a clairvoyant decentralized planning algorithm where agents can communicate despite not being in LOS of each other.
AutoDIME: Automatic Design of Interesting Multi-Agent Environments
Kanitscheider, Ingmar, Edwards, Harri
Designing a distribution of environments in which RL agents can learn interesting and useful skills is a challenging and poorly understood task, for multi-agent environments the difficulties are only exacerbated. One approach is to train a second RL agent, called a teacher, who samples environments that are conducive for the learning of student agents. However, most previous proposals for teacher rewards do not generalize straightforwardly to the multi-agent setting. We examine a set of intrinsic teacher rewards derived from prediction problems that can be applied in multi-agent settings and evaluate them in Mujoco tasks such as multiagent Hide and Seek [1] as well as a diagnostic single-agent maze task. Of the intrinsic rewards considered we found value disagreement to be most consistent across tasks, leading to faster and more reliable emergence of advanced skills in Hide and Seek and the maze task. Another candidate intrinsic reward considered, value prediction error, also worked well in Hide and Seek but was susceptible to noisy-TV style distractions in stochastic environments. Policy disagreement performed well in the maze task but did not speed up learning in Hide and Seek. Our results suggest that intrinsic teacher rewards, and in particular value disagreement, are a promising approach for automating both single and multi-agent environment design.
Predicting Like A Pilot: Dataset and Method to Predict Socially-Aware Aircraft Trajectories in Non-Towered Terminal Airspace
Patrikar, Jay, Moon, Brady, Oh, Jean, Scherer, Sebastian
Pilots operating aircraft in un-towered airspace rely on their situational awareness and prior knowledge to predict the future trajectories of other agents. These predictions are conditioned on the past trajectories of other agents, agent-agent social interactions and environmental context such as airport location and weather. This paper provides a dataset, $\textit{TrajAir}$, that captures this behaviour in a non-towered terminal airspace around a regional airport. We also present a baseline socially-aware trajectory prediction algorithm, $\textit{TrajAirNet}$, that uses the dataset to predict the trajectories of all agents. The dataset is collected for 111 days over 8 months and contains ADS-B transponder data along with the corresponding METAR weather data. The data is processed to be used as a benchmark with other publicly available social navigation datasets. To the best of authors' knowledge, this is the first 3D social aerial navigation dataset thus introducing social navigation for autonomous aviation. $\textit{TrajAirNet}$ combines state-of-the-art modules in social navigation to provide predictions in a static environment with a dynamic context. Both the $\textit{TrajAir}$ dataset and $\textit{TrajAirNet}$ prediction algorithm are open-source. The dataset, codebase, and video are available at https://theairlab.org/trajair/, https://github.com/castacks/trajairnet, and https://youtu.be/elAQXrxB2gw respectively.
Competitors-Aware Stochastic Lap Strategy Optimisation for Race Hybrid Vehicles
Braghin, Francesco, Paparusso, Luca, Riani, Manuel, Ruggeri, Fabio
World Endurance Championship (WEC) racing events are characterised by a relevant performance gap among competitors. The fastest vehicles category, consisting in hybrid vehicles, has to respect energy usage constraints set by the technical regulation. Considering absence of competitors, i.e. traffic conditions, the optimal energy usage strategy for lap time minimisation is typically computed through a constrained optimisation problem. To the best of our knowledge, the majority of state-of-the-art works neglects competitors. This leads to a mismatch with the real world, where traffic generates considerable time losses. To bridge this gap, we propose a new framework to offline compute optimal strategies for the powertrain energy management considering competitors. Through analysis of the available data from previous events, statistics on the sector times and overtaking probabilities are extracted to encode the competitors' behaviour. Adopting a multi-agent model, the statistics are then used to generate realistic Monte Carlo (MC) simulation of their position along the track. The simulator is then adopted to identify the optimal strategy as follows. We develop a longitudinal vehicle model for the ego-vehicle and implement an optimisation problem for lap time minimisation in absence of traffic, based on Genetic Algorithms. Solving the optimisation problem for a variety of constraints generates a set of candidate optimal strategies. Stochastic Dynamic Programming is finally implemented to choose the best strategy considering competitors, whose motion is generated by the MC simulator. Our approach, validated on data from a real stint of race, allows to significantly reduce the lap time.
Provably Efficient Convergence of Primal-Dual Actor-Critic with Nonlinear Function Approximation
Dong, Jing, Shen, Li, Xu, Yinggan, Wang, Baoxiang
We study the convergence of the actor-critic algorithm with nonlinear function approximation under a nonconvex-nonconcave primal-dual formulation. Stochastic gradient descent ascent is applied with an adaptive proximal term for robust learning rates. We show the first efficient convergence result with primal-dual actor-critic with a convergence rate of $\mathcal{O}\left(\sqrt{\frac{\ln \left(N d G^2 \right)}{N}}\right)$ under Markovian sampling, where $G$ is the element-wise maximum of the gradient, $N$ is the number of iterations, and $d$ is the dimension of the gradient. Our result is presented with only the Polyak-\L{}ojasiewicz condition for the dual variables, which is easy to verify and applicable to a wide range of reinforcement learning (RL) scenarios. The algorithm and analysis are general enough to be applied to other RL settings, like multi-agent RL. Empirical results on OpenAI Gym continuous control tasks corroborate our theoretical findings.
Why teaching robots to play hide-and-seek could be the key to next-gen A.I.
Artificial general intelligence, the idea of an intelligent A.I. agent that's able to understand and learn any intellectual task that humans can do, has long been a component of science fiction. As A.I. gets smarter and smarter -- especially with breakthroughs in machine learning tools that are able to rewrite their code to learn from new experiences -- it's increasingly widely a part of real artificial intelligence conversations as well. But how do we measure AGI when it does arrive? Over the years, researchers have laid out a number of possibilities. The most famous remains the Turing Test, in which a human judge interacts, sight unseen, with both humans and a machine, and must try and guess which is which.
Stanford AI Lab Papers and Talks at AAAI 2022
The 36th AAAI Conference on Artificial Intelligence (AAAI 2022) is being hosted virtually from February 22th - March 1st. We're excited to share all the work from SAIL that's being presented, and you'll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that's happening at Stanford. We look forward to seeing you at AAAI 2022.