Goto

Collaborating Authors

 Reinforcement Learning


Synthesizing Chemical Plant Operation Procedures using Knowledge, Dynamic Simulation and Deep Reinforcement Learning

arXiv.org Artificial Intelligence

Chemical plants are complex and dynamical systems consisting of many components for manipulation and sensing, whose state transitions depend on various factors such as time, disturbance, and operation procedures. For the purpose of supporting human operators of chemical plants, we are developing an AI system that can semi-automatically synthesize operation procedures for efficient and stable operation. Our system can provide not only appropriate operation procedures but also reasons why the procedures are considered to be valid. This is achieved by integrating automated reasoning and deep reinforcement learning technologies with a chemical plant simulator and external knowledge. Our preliminary experimental results demonstrate that it can synthesize a procedure that achieves a much faster recovery from a malfunction compared to standard PID control.


Safety-Guided Deep Reinforcement Learning via Online Gaussian Process Estimation

arXiv.org Artificial Intelligence

An important facet of reinforcement learning (RL) has to do with how the agent goes about exploring the environment. Traditional exploration strategies typically focus on efficiency and ignore safety. However, for practical applications, ensuring safety of the agent during exploration is crucial since performing an unsafe action or reaching an unsafe state could result in irreversible damage to the agent. The main challenge of safe exploration is that characterizing the unsafe states and actions is difficult for large continuous state or action spaces and unknown environments. In this paper, we propose a novel approach to incorporate estimations of safety to guide exploration and policy search in deep reinforcement learning. By using a cost function to capture trajectory-based safety, our key idea is to formulate the state-action value function of this safety cost as a candidate Lyapunov function and extend control-theoretic results to approximate its derivative using online Gaussian Process (GP) estimation. We show how to use these statistical models to guide the agent in unknown environments to obtain high-performance control policies with provable stability certificates.


Understanding the Artificial Intelligence Clinician and optimal treatment strategies for sepsis in intensive care

arXiv.org Artificial Intelligence

In this document, we explore in more detail our published work (Komorowski, Celi, Badawi, Gordon, & Faisal, 2018) for the benefit of the AI in Healthcare research community. In the above paper, we developed the AI Clinician system, which demonstrated how reinforcement learning could be used to make useful recommendations towards optimal treatment decisions from intensive care data. Since publication a number of authors have reviewed our work (e.g. Given the difference of our framework to previous work, the fact that we are bridging two very different academic communities (intensive care and machine learning) and that our work has impact on a number of other areas with more traditional computer-based approaches (biosignal processing and control, biomedical engineering), we are providing here additional details on our recent publication. We acknowledge the online comments by Jeter et al (https://arxiv.org/abs/1902.03271). The sections of the present document are structured so as to address some of their questions. For clarity, we label figures from our main Nature Medicine publication as "M", figures from Jeter et al.'s arXiv paper as "J" and figures from our response here as "R". Jeter et al. state "the only possible response we can afford is a more aggressive and open dialogue".


Can Sophisticated Dispatching Strategy Acquired by Reinforcement Learning? - A Case Study in Dynamic Courier Dispatching System

arXiv.org Artificial Intelligence

In this paper, we study a courier dispatching problem (CDP) raised from an online pickup-service platform of Alibaba. The CDP aims to assign a set of couriers to serve pickup requests with stochastic spatial and temporal arrival rate among urban regions. The objective is to maximize the revenue of served requests given a limited number of couriers over a period of time. Many online algorithms such as dynamic matching and vehicle routing strategy from existing literature could be applied to tackle this problem. However, these methods rely on appropriately predefined optimization objectives at each decision point, which is hard in dynamic situations. This paper formulates the CDP as a Markov decision process (MDP) and proposes a data-driven approach to derive the optimal dispatching rule-set under different scenarios. Our method stacks multi-layer images of the spatial-and-temporal map and apply multi-agent reinforcement learning (MARL) techniques to evolve dispatching models. This method solves the learning inefficiency caused by traditional centralized MDP modeling. Through comprehensive experiments on both artificial dataset and real-world dataset, we show: 1) By utilizing historical data and considering long-term revenue gains, MARL achieves better performance than myopic online algorithms; 2) MARL is able to construct the mapping between complex scenarios to sophisticated decisions such as the dispatching rule. 3) MARL has the scalability to adopt in large-scale real-world scenarios.


Using World Models for Pseudo-Rehearsal in Continual Learning

arXiv.org Machine Learning

The utility of learning a dynamics/world model of the environment in reinforcement learning has been shown in a many ways. When using neural networks, however, these models suffer catastrophic forgetting when learned in a lifelong or continual fashion. Current solutions to the continual learning problem require experience to be segmented and labeled as discrete tasks, however, in continuous experience it is generally unclear what a sufficient segmentation of tasks would be. Here we propose a method to continually learn these internal world models through the interleaving of internally generated rollouts from past experiences (i.e., pseudo-rehearsal). We show this method can sequentially learn unsupervised temporal prediction, without task labels, in a disparate set of Atari games. Empirically, this interleaving of the internally generated rollouts with the external environment's observations leads to an average 4.5x reduction in temporal prediction loss compared to non-interleaved learning. Similarly, we show that the representations of this internal model remain stable across learned environments. Here, an agent trained using an initial version of the internal model can perform equally well when using a subsequent version that has successfully incorporated experience from multiple new environments.


Super Mario Bros. Reinforcement Learning โ€“ Data Driven Investor โ€“ Medium

#artificialintelligence

I am sure you have played Super Mario Bros. If you played it, you already know how much hard it is. What if I tell you computer can solve the game without describing all rules of the game? I will show you a new exciting application of Reinforcement Learning method on it. The first level of the game will be learned by the computer without our intelligence, it will learn by itself.


Deep Active Localization

arXiv.org Artificial Intelligence

Active localization is the problem of generating robot actions that allow it to maximally disambiguate its pose within a reference map. Traditional approaches to this use an information-theoretic criterion for action selection and hand-crafted perceptual models. In this work we propose an end-to-end differentiable method for learning to take informative actions that is trainable entirely in simulation and then transferable to real robot hardware with zero refinement. The system is composed of two modules: a convolutional neural network for perception, and a deep reinforcement learned planning module. We introduce a multi-scale approach to the learned perceptual model since the accuracy needed to perform action selection with reinforcement learning is much less than the accuracy needed for robot control. We demonstrate that the resulting system outperforms using the traditional approach for either perception or planning. We also demonstrate our approaches robustness to different map configurations and other nuisance parameters through the use of domain randomization in training. The code is also compatible with the OpenAI gym framework, as well as the Gazebo simulator.


Using Natural Language for Reward Shaping in Reinforcement Learning

arXiv.org Artificial Intelligence

Recent reinforcement learning (RL) approaches have shown strong performance in complex domains such as Atari games, but are often highly sample inefficient. A common approach to reduce interaction time with the environment is to use reward shaping, which involves carefully designing reward functions that provide the agent intermediate rewards for progress towards the goal. However, designing appropriate shaping rewards is known to be difficult as well as time-consuming. In this work, we address this problem by using natural language instructions to perform reward shaping. We propose the LanguagE-Action Reward Network (LEARN), a framework that maps free-form natural language instructions to intermediate rewards based on actions taken by the agent. These intermediate language-based rewards can seamlessly be integrated into any standard reinforcement learning algorithm. We experiment with Montezuma's Revenge from the Atari Learning Environment, a popular benchmark in RL. Our experiments on a diverse set of 15 tasks demonstrate that, for the same number of interactions with the environment, language-based rewards lead to successful completion of the task 60% more often on average, compared to learning without language.


Model-Based Reinforcement Learning for Atari

arXiv.org Machine Learning

Model-free reinforcement learning (RL) can be used to learn effective policies for complex tasks, such as Atari games, even from image observations. However, this typically requires very large amounts of interaction -- substantially more, in fact, than a human would need to learn the same games. How can people learn so quickly? Part of the answer may be that people can learn how the game works and predict which actions will lead to desirable outcomes. In this paper, we explore how video prediction models can similarly enable agents to solve Atari games with orders of magnitude fewer interactions than model-free methods. We describe Simulated Policy Learning (SimPLe), a complete model-based deep RL algorithm based on video prediction models and present a comparison of several model architectures, including a novel architecture that yields the best results in our setting. Our experiments evaluate SimPLe on a range of Atari games and achieve competitive results with only 100K interactions between the agent and the environment (400K frames), which corresponds to about two hours of real-time play.


Reinforcement Learning System Automatically Trains Prosthetic Legs

#artificialintelligence

Powered leg prostheses can give amputees the ability to walk for long periods of time and to do so briskly. The reality is that these devices are pretty clunky and programming them to operate smoothly to produce a natural gait takes hours, and the results are rarely perfect. Scientists at North Carolina State University, the University of North Carolina, and Arizona State University have now developed an automated tuning system that relies purely on a technique called reinforcement learning to do its job. This allows the patient to simply walk with a new prosthetic on a treadmill while guided by a therapist. The system monitors the intentions of the patient and movement of the prosthetic, and adjusts in real time based on the readings.