Reinforcement Learning
ELLA: Exploration through Learned Language Abstraction
Mirchandani, Suvir, Karamcheti, Siddharth, Sadigh, Dorsa
Building agents capable of understanding language instructions is critical to effective and robust human-AI collaboration. Recent work focuses on training these instruction following agents via reinforcement learning in environments with synthetic language; however, these instructions often define long-horizon, sparse-reward tasks, and learning policies requires many episodes of experience. To this end, we introduce ELLA: Exploration through Learned Language Abstraction, a reward shaping approach that correlates high-level instructions with simpler low-level instructions to enrich the sparse rewards afforded by the environment. ELLA has two key elements: 1) A termination classifier that identifies when agents complete low-level instructions, and 2) A relevance classifier that correlates low-level instructions with success on high-level tasks. We learn the termination classifier offline from pairs of instructions and terminal states. Notably, in departure from prior work in language and abstraction, we learn the relevance classifier online, without relying on an explicit decomposition of high-level instructions to low-level instructions. On a suite of complex grid world environments with varying instruction complexities and reward sparsity, ELLA shows a significant gain in sample efficiency across several environments compared to competitive language-based reward shaping and no-shaping methods.
The AI Arena: A Framework for Distributed Multi-Agent Reinforcement Learning
Staley, Edward W., Rivera, Corban G., Llorens, Ashley J.
Advances in reinforcement learning (RL) have resulted in recent breakthroughs in the application of artificial intelligence (AI) across many different domains. An emerging landscape of development environments is making powerful RL techniques more accessible for a growing community of researchers. However, most existing frameworks do not directly address the problem of learning in complex operating environments, such as dense urban settings or defense-related scenarios, that incorporate distributed, heterogeneous teams of agents. To help enable AI research for this important class of applications, we introduce the AI Arena: a scalable framework with flexible abstractions for distributed multi-agent reinforcement learning. The AI Arena extends the OpenAI Gym interface to allow greater flexibility in learning control policies across multiple agents with heterogeneous learning strategies and localized views of the environment. To illustrate the utility of our framework, we present experimental results that demonstrate performance gains due to a distributed multi-agent learning approach over commonly-used RL techniques in several different learning environments.
Challenges for Reinforcement Learning in Healthcare
Riachi, Elsa, Mamdani, Muhammad, Fralick, Michael, Rudzicz, Frank
Many healthcare decisions involve navigating through a multitude of treatment options in a sequential and iterative manner to find an optimal treatment pathway with the goal of an optimal patient outcome. Such optimization problems may be amenable to reinforcement learning. A reinforcement learning agent could be trained to provide treatment recommendations for physicians, acting as a decision support tool. However, a number of difficulties arise when using RL beyond benchmark environments, such as specifying the reward function, choosing an appropriate state representation and evaluating the learned policy.
Variational quantum policies for reinforcement learning
Jerbi, Sofiene, Gyurik, Casper, Marshall, Simon, Briegel, Hans J., Dunjko, Vedran
Variational quantum circuits have recently gained popularity as quantum machine learning models. While considerable effort has been invested to train them in supervised and unsupervised learning settings, relatively little attention has been given to their potential use in reinforcement learning. In this work, we leverage the understanding of quantum policy gradient algorithms in a number of ways. First, we investigate how to construct and train reinforcement learning policies based on variational quantum circuits. We propose several designs for quantum policies, provide their learning algorithms, and test their performance on classical benchmarking environments. Second, we show the existence of task environments with a provable separation in performance between quantum learning agents and any polynomial-time classical learner, conditioned on the widely-believed classical hardness of the discrete logarithm problem. We also consider more natural settings, in which we show an empirical quantum advantage of our quantum policies over standard neural-network policies. Our results constitute a first step towards establishing a practical near-term quantum advantage in a reinforcement learning setting. Additionally, we believe that some of our design choices for variational quantum policies may also be beneficial to other models based on variational quantum circuits, such as quantum classifiers and quantum regression models.
Decentralized Circle Formation Control for Fish-like Robots in the Real-world via Reinforcement Learning
Zhang, Tianhao, Li, Yueheng, Li, Shuai, Ye, Qiwei, Wang, Chen, Xie, Guangming
In this paper, the circle formation control problem is addressed for a group of cooperative underactuated fish-like robots involving unknown nonlinear dynamics and disturbances. Based on the reinforcement learning and cognitive consistency theory, we propose a decentralized controller without the knowledge of the dynamics of the fish-like robots. The proposed controller can be transferred from simulation to reality. It is only trained in our established simulation environment, and the trained controller can be deployed to real robots without any manual tuning. Simulation results confirm that the proposed model-free robust formation control method is scalable with respect to the group size of the robots and outperforms other representative RL algorithms. Several experiments in the real world verify the effectiveness of our RL-based approach for circle formation control.
The Societal Implications of Deep Reinforcement Learning
Whittlestone, Jess | Arulkumaran, Kai | Crosby, Matthew (Imperial College London)
Deep Reinforcement Learning (DRL) is an avenue of research in Artificial Intelligence (AI) that has received increasing attention within the research community in recent years, and is beginning to show potential for real-world application. DRL is one of the most promising routes towards developing more autonomous AI systems that interact with and take actions in complex real-world environments, and can more flexibly solve a range of problems for which we may not be able to precisely specify a correct ‘answer’. This could have substantial implications for people’s lives: for example by speeding up automation in various sectors, changing the nature and potential harms of online influence, or introducing new safety risks in physical infrastructure. In this paper, we review recent progress in DRL, discuss how this may introduce novel and pressing issues for society, ethics, and governance, and highlight important avenues for future research to better understand DRL’s societal implications. This article appears in the special track on AI and Society.
Provably Efficient Cooperative Multi-Agent Reinforcement Learning with Function Approximation
Dubey, Abhimanyu, Pentland, Alex
Cooperative multi-agent reinforcement learning (MARL) systems are widely prevalent in many engineering systems, e.g., robotic systems (Ding et al., 2020), power grids (Yu et al., 2014), traffic control (Bazzan, 2009), as well as team games (Zhao et al., 2019). Increasingly, federated (Yang et al., 2019) and distributed (Peteiro-Barral & Guijarro-Berdiñas, 2013) machine learning is gaining prominence in industrial applications, and reinforcement learning in these large-scale settings is becoming of import in the research community as well (Zhuo et al., 2019; Liu et al., 2019). Recent research in the statistical learning community has focused on cooperative multi-agent decision-making algorithms with provable guarantees(Zhang et al., 2018b; Wai et al., 2018; Zhang et al., 2018a). However, prior work focuses on algorithms that, while are decentralized, provide guarantees on convergence (e.g., Zhang et al. (2018b)) but no finite-sample guarantees for regret, in contrast to efficient algorithms with function approximation proposed for single-agent RL (e.g., Jin et al. (2018, 2020); Yang et al. (2020)). Moreover, optimization in the decentralized multi-agent setting is also known to be non-convergent without assumptions (Tan, 1993). Developing no-regret multi-agent algorithms is therefore an important problem in RL. For the (relatively) easier problem of multi-agent multi-armed bandits, there has been significant recent interest in decentralized algorithms involving agents communicating over a network (Landgren et al., 2016a, 2018; Martínez-Rubio et al., 2019; Dubey & Pentland, 2020b), as well as in the distributed settings (Hillel et al., 2013; Wang et al., 2019). Since several application areas for distributed sequential decision-making regularly involve non-stationarity and contextual information (Polydoros & Nalpantidis, 2017), an MDP formulation can potentially provide stronger algorithms for these settings as well. Furthermore, no-regret algorithms in the single-agent RL setting with function approximation (e.g., Jin et al. (2020)) build on analysis techniques for contextual bandits, which leads us to the question - Can no-regret function approximation be extended to (decentralized) cooperative multi-agent reinforcement learning?
Computational Impact Time Guidance: A Learning-Based Prediction-Correction Approach
Liu, Zichao, Wang, Jiang, He, Shaoming, Shin, Hyo-Sang, Tsourdos, Antonios
This paper investigates the problem of impact-time-control and proposes a learning-based computational guidance algorithm to solve this problem. The proposed guidance algorithm is developed based on a general prediction-correction concept: the exact time-to-go under proportional navigation guidance with realistic aerodynamic characteristics is estimated by a deep neural network and a biased command to nullify the impact time error is developed by utilizing the emerging reinforcement learning techniques. The deep neural network is augmented into the reinforcement learning block to resolve the issue of sparse reward that has been observed in typical reinforcement learning formulation. Extensive numerical simulations are conducted to support the proposed algorithm.
Domain-Robust Visual Imitation Learning with Mutual Information Constraints
Cetin, Edoardo, Celiktutan, Oya
Human beings are able to understand objectives and learn by simply observing others perform a task. Imitation learning methods aim to replicate such capabilities, however, they generally depend on access to a full set of optimal states and actions taken with the agent's actuators and from the agent's point of view. In this paper, we introduce a new algorithm - called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL) - with the purpose of bypassing such constraints. Our algorithm enables autonomous agents to learn directly from high dimensional observations of an expert performing a task, by making use of adversarial learning with a latent representation inside the discriminator network. Such latent representation is regularized through mutual information constraints to incentivize learning only features that encode information about the completion levels of the task being demonstrated. This allows to obtain a shared feature space to successfully perform imitation while disregarding the differences between the expert's and the agent's domains. Empirically, our algorithm is able to efficiently imitate in a diverse range of control problems including balancing, manipulation and locomotive tasks, while being robust to various domain differences in terms of both environment appearance and agent embodiment.