Reinforcement Learning
Decoupled Data Based Approach for Learning to Control Nonlinear Dynamical Systems
Wang, Ran, Parunandi, Karthikeya, Yu, Dan, Kalathil, Dileep, Chakravorty, Suman
This paper addresses the problem of learning the optimal control policy for a nonlinear stochastic dynamical system with continuous state space, continuous action space and unknown dynamics. This class of problems are typically addressed in stochastic adaptive control and reinforcement learning literature using model-based and model-free approaches respectively. Both methods rely on solving a dynamic programming problem, either directly or indirectly, for finding the optimal closed loop control policy. The inherent `curse of dimensionality' associated with dynamic programming method makes these approaches also computationally difficult. This paper proposes a novel decoupled data-based control (D2C) algorithm that addresses this problem using a decoupled, `open loop - closed loop', approach. First, an open-loop deterministic trajectory optimization problem is solved using a black-box simulation model of the dynamical system. Then, a closed loop control is developed around this open loop trajectory by linearization of the dynamics about this nominal trajectory. By virtue of linearization, a linear quadratic regulator based algorithm can be used for this closed loop control. We show that the performance of D2C algorithm is approximately optimal. Moreover, simulation performance suggests significant reduction in training time compared to other state of the art algorithms.
A Game Theoretical Framework for the Evaluation of Unmanned Aircraft Systems Airspace Integration Concepts
Predicting the outcomes of integrating Unmanned Aerial Systems (UAS) into the National Aerospace (NAS) is a complex problem which is required to be addressed by simulation studies before allowing the routine access of UAS into the NAS. This thesis focuses on providing 2D and 3D simulation frameworks using a game theoretical methodology to evaluate integration concepts in scenarios where manned and unmanned air vehicles co-exist. The fundamental gap in the literature is that the models of interaction between manned and unmanned vehicles are insufficient: a) they assume that pilot behavior is known a priori and b) they disregard decision making processes. The contribution of this work is to propose a modeling framework, in which, human pilot reactions are modeled using reinforcement learning and a game theoretical concept called level-k reasoning to fill this gap. The level-k reasoning concept is based on the assumption that humans have various levels of decision making. Reinforcement learning is a mathematical learning method that is rooted in human learning. In this work, a classical and an approximate reinforcement learning (Neural Fitted Q Iteration) methods are used to model time-extended decisions of pilots with 2D and 3D maneuvers. An analysis of UAS integration is conducted using example scenarios in the presence of manned aircraft and fully autonomous UAS equipped with sense and avoid algorithms.
Towards Robust Deep Reinforcement Learning for Traffic Signal Control: Demand Surges, Incidents and Sensor Failures
Rodrigues, Filipe, Azevedo, Carlos Lima
Reinforcement learning (RL) constitutes a promising solution for alleviating the problem of traffic congestion. In particular, deep RL algorithms have been shown to produce adaptive traffic signal controllers that outperform conventional systems. However, in order to be reliable in highly dynamic urban areas, such controllers need to be robust with the respect to a series of exogenous sources of uncertainty. In this paper, we develop an open-source callback-based framework for promoting the flexible evaluation of different deep RL configurations under a traffic simulation environment. With this framework, we demonstrate how deep RL-based adaptive traffic controllers perform under different scenarios, namely under demand surges caused by special events, capacity reductions from incidents and sensor failures. We extract several key insights for the development of robust deep RL algorithms for traffic control and propose concrete designs to mitigate the impact of the considered exogenous uncertainties.
Rogue-Gym: A New Challenge for Generalization in Reinforcement Learning
Kanagawa, Yuji, Kaneko, Tomoyuki
This paper presents Rogue-Gym, that enables agents to learn and play a subset of the original Rogue game with the OpenAI Gym interface. In roguelike games, a player explores a dungeon where each floor is two dimensional grid maze with enemies, golds, and downstairs. Because the map of a dungeon is different each time an agent starts a new game, learning in Rogue-Gym inevitably involves generalization of experiences, in a highly abstract manner. We argue that this generalization in reinforcement learning is a big challenge for AI agents. Recently, deep reinforcement learning (DRL) has succeeded in many games. However, it has been pointed out that agents trained by DRL methods often overfit to the training environment. To investigate this problem, some research environments with procedural content generation have been proposed. Following these studies, we show that our Rogue-Gym imposes a new generalization problem of their policies. In our experiments, we evaluate a standard reinforcement learning method, PPO, with and without enhancements for generalization. The results show that some enhancements work effective, but that there is still a large room for improvement. Therefore, Rogue-Gym a is a new challenging domain for further studies.
Robust Exploration with Tight Bayesian Plausibility Sets
Russel, Reazul H., Gu, Tianyi, Petrik, Marek
Optimism about the poorly understood states and actions is the main driving force of exploration for many provably-efficient reinforcement learning algorithms. We propose optimism in the face of sensible value functions (OFVF)- a novel data-driven Bayesian algorithm to constructing Plausibility sets for MDPs to explore robustly minimizing the worst case exploration cost. The method computes policies with tighter optimistic estimates for exploration by introducing two new ideas. First, it is based on Bayesian posterior distributions rather than distribution-free bounds. Second, OFVF does not construct plausibility sets as simple confidence intervals. Confidence intervals as plausibility sets are a sufficient but not a necessary condition. OFVF uses the structure of the value function to optimize the location and shape of the plausibility set to guarantee upper bounds directly without necessarily enforcing the requirement for the set to be a confidence interval. OFVF proceeds in an episodic manner, where the duration of the episode is fixed and known. Our algorithm is inherently Bayesian and can leverage prior information. Our theoretical analysis shows the robustness of OFVF, and the empirical results demonstrate its practical promise.
Off-Policy Policy Gradient with State Distribution Correction
Liu, Yao, Swaminathan, Adith, Agarwal, Alekh, Brunskill, Emma
The ability to use data about prior decisions and their outcomes to make counterfactual inferences about how alternative decision policies might perform, is a cornerstone of intelligent behavior. It also has immense practical potential - it can enable the use of electronic medical record data to infer better treatment decisions for patients, the use of prior product recommendations to inform more effective strategies for presenting recommendations, and previously collected data from students using educational software to better teach those and future students. Such counterfactual reasoning, particularly when one is deriving decision policies that will be used to make not one but a sequence of decisions, is important since online sampling during a learning procedure is both costly and dangerous, and not practical in many of the applications above. While amply motivated, doing such counterfactual reasoning is also challenging because the data is censored - we can only observe the result of providing a particular chemotherapy treatment policy to a particular patient, not the counterfactual of if we were then to start with a radiation sequence. We focus on the problem of performing such counterfactual inferences in the context of sequential decision making in a Markov decision process (MDP).
A Survey on Traffic Signal Control Methods
Wei, Hua, Zheng, Guanjie, Gayah, Vikash, Li, Zhenhui
Traffic congestion is a growing problem that continues to plague urban areas with negative outcomes to both the traveling public and society as a whole. These negative outcomes will only grow over time as more people flock to urban areas. In 2014, traffic congestion costs Americans over $160 billion in lost productivity and wasted over 3.1 billion gallons of fuel [Economist 2014]. Traffic congestion was also attributed to over 56 billion pounds of harmful CO2 emissions in 2011 [Schrank et al. 2015]. In the European Union, the cost of traffic congestion was equivalent to 1% of the entire GDP [Schrank et al. 2012]. Mitigating congestion would have significant economic, environmental and societal benefits. Signalized intersections are one of the most prevalent bottleneck types in urban environments, and thus traffic signal control plays a vital role in urban traffic management.
End-to-End Robotic Reinforcement Learning without Reward Engineering
Singh, Avi, Yang, Larry, Hartikainen, Kristian, Finn, Chelsea, Levine, Sergey
The combination of deep neural network models and reinforcement learning algorithms can make it possible to learn policies for robotic behaviors that directly read in raw sensory inputs, such as camera images, effectively subsuming both estimation and control into one model. However, real-world applications of reinforcement learning must specify the goal of the task by means of a manually programmed reward function, which in practice requires either designing the very same perception pipeline that end-to-end reinforcement learning promises to avoid, or else instrumenting the environment with additional sensors to determine if the task has been performed successfully. In this paper, we propose an approach for removing the need for manual engineering of reward specifications by enabling a robot to learn from a modest number of examples of successful outcomes, followed by actively solicited queries, where the robot shows the user a state and asks for a label to determine whether that state represents successful completion of the task. While requesting labels for every single state would amount to asking the user to manually provide the reward signal, our method requires labels for only a tiny fraction of the states seen during training, making it an efficient and practical approach for learning skills without manually engineered rewards. We evaluate our method on real-world robotic manipulation tasks where the observations consist of images viewed by the robot's camera. In our experiments, our method effectively learns to arrange objects, place books, and drape cloth, directly from images and without any manually specified reward functions, and with only 1-4 hours of interaction with the real world.
Simion Zoo: A Workbench for Distributed Experimentation with Reinforcement Learning for Continuous Control Tasks
Fernandez-Gauna, Borja, Graña, Manuel, Zimmermann, Roland S.
In recent years, Reinforcement Learning (RL) has become a very popular area of research, because of the almost exponential increase in computing power due to the advent of dedicated GPUs that have empowered researchers to face previously unaffordable problems. In particular, the successful applications of Deep Reinforcement Learning (DRL)to produce master videogame players [10, 7] have created great expectations about the potential of DRL, even outside the academic research community. As a result of this popularity boost, the number of RL software packages has grown significantly. Nevertheless, these projects are mostly oriented towards the research community, i.e. they assume sophisticated programming users with powerful computing resources to run the software. Even for sophisticated programmers, these packages impose a steep learning curve that hinders their user experience. This is in stark contrast with the de-facto user standards forSupervised Learning (SL) software, which customarily allow users to design/run experiments, and to analyze the results on an intuitive Graphical User Interface (GUI) that allows a swift learning curve. Users without programming skills that intend to design and run RL experiments quickly on inexpensive and commonly available hardware will obviously appreciate such kind of facilities.
Learning 3D Navigation Protocols on Touch Interfaces with Cooperative Multi-Agent Reinforcement Learning
Debard, Quentin, Dibangoye, Jilles Steeve, Canu, Stéphane, Wolf, Christian
Using touch devices to navigate in virtual 3D environments such as computer assisted design (CAD) models or geographical information systems (GIS) is inherently difficult for humans, as the 3D operations have to be performed by the user on a 2D touch surface. This ill-posed problem is classically solved with a fixed and handcrafted interaction protocol, which must be learned by the user. We propose to automatically learn a new interaction protocol allowing to map a 2D user input to 3D actions in virtual environments using reinforcement learning (RL). A fundamental problem of RL methods is the vast amount of interactions often required, which are difficult to come by when humans are involved. To overcome this limitation, we make use of two collaborative agents. The first agent models the human by learning to perform the 2D finger trajectories. The second agent acts as the interaction protocol, interpreting and translating to 3D operations the 2D finger trajectories from the first agent. We restrict the learned 2D trajectories to be similar to a training set of collected human gestures by first performing state representation learning, prior to reinforcement learning. This state representation learning is addressed by projecting the gestures into a latent space learned by a variational auto encoder (VAE).