Reinforcement Learning
Approximate Inverse Reinforcement Learning from Vision-based Imitation Learning
Lee, Keuntaek, Vlahov, Bogdan, Gibson, Jason, Rehg, James M., Theodorou, Evangelos A.
In this work, we present a method for obtaining an implicit objective function for vision-based navigation. The proposed methodology relies on Imitation Learning, Model Predictive Control (MPC), and Deep Learning. We use Imitation Learning as a means to do Inverse Reinforcement Learning in order to create an approximate costmap generator for a visual navigation challenge. The resulting costmap is used in conjunction with a Model Predictive Controller for real-time control and outperforms other state-of-the-art costmap generators combined with MPC in novel environments. The proposed process allows for simple training and robustness to out-of-sample data. We apply our method to the task of vision-based autonomous driving in multiple real and simulated environments using the same weights for the costmap predictor in all environments.
Regret Balancing for Bandit and RL Model Selection
Abbasi-Yadkori, Yasin, Pacchiano, Aldo, Phan, My
We consider model selection in stochastic bandit and reinforcement learning problems. Given a set of base learning algorithms, an effective model selection strategy adapts to the best learning algorithm in an online fashion. We show that by estimating the regret of each algorithm and playing the algorithms such that all empirical regrets are ensured to be of the same order, the overall regret balancing strategy achieves a regret that is close to the regret of the optimal base algorithm. Our strategy requires an upper bound on the optimal base regret as input, and the performance of the strategy depends on the tightness of the upper bound. We show that having this prior knowledge is necessary in order to achieve a near-optimal regret. Further, we show that any near-optimal model selection strategy implicitly performs a form of regret balancing.
Real-Time Model Calibration with Deep Reinforcement Learning
Tian, Yuan, Chao, Manuel Arias, Kulkarni, Chetan, Goebel, Kai, Fink, Olga
The dynamic, real-time, and accurate inference of model parameters from empirical data is of great importance in many scientific and engineering disciplines that use computational models (such as a digital twin) for the analysis and prediction of complex physical processes. However, fast and accurate inference for processes with large and high dimensional datasets cannot easily be achieved with state-of-the-art methods under noisy real-world conditions. The primary reason is that the inference of model parameters with traditional techniques based on optimisation or sampling often suffers from computational and statistical challenges, resulting in a trade-off between accuracy and deployment time. In this paper, we propose a novel framework for inference of model parameters based on reinforcement learning. The contribution of the paper is twofold: 1) We reformulate the inference problem as a tracking problem with the objective of learning a policy that forces the response of the physics-based model to follow the observations; 2) We propose the constrained Lyapunov-based actor-critic (CLAC) algorithm to enable the robust and accurate inference of physics-based model parameters in real time under noisy real-world conditions. The proposed methodology is demonstrated and evaluated on two model-based diagnostics test cases utilizing two different physics-based models of turbofan engines. The performance of the methodology is compared to that of two alternative approaches: a state update method (unscented Kalman filter) and a supervised end-to-end mapping with deep neural networks. The experimental results demonstrate that the proposed methodology outperforms all other tested methods in terms of speed and robustness, with high inference accuracy.
Mirror Descent Policy Optimization
Tomar, Manan, Shani, Lior, Efroni, Yonathan, Ghavamzadeh, Mohammad
We propose deep Reinforcement Learning (RL) algorithms inspired by mirror descent, a well-known first-order trust region optimization method for solving constrained convex problems. Our approach, which we call as Mirror Descent Policy Optimization (MDPO), is based on the idea of iteratively solving a `trust-region' problem that minimizes a sum of two terms: a linearization of the objective function and a proximity term that restricts two consecutive updates to be close to each other. Following this approach we derive on-policy and off-policy variants of the MDPO algorithm and analyze their performance while emphasizing important implementation details, motivated by the existing theoretical framework. We highlight the connections between on-policy MDPO and two popular trust region RL algorithms: TRPO and PPO, and conduct a comprehensive empirical comparison of these algorithms. We then derive off-policy MDPO and compare its performance to existing approaches. Importantly, we show that the theoretical framework of MDPO can be scaled to deep RL while achieving good performance on popular benchmarks.
Fitted Q-Learning for Relational Domains
Das, Srijita, Natarajan, Sriraam, Roy, Kaushik, Parr, Ronald, Kersting, Kristian
We take two specific approaches - first Value function approximation in Reinforcement Learning is to represent the lifted Q-value functions and the second (RL) has long been viewed using the lens of feature discovery is to represent the Bellman residuals - both using a set of (Parr et al. 2007). A set of classical approaches relational regression trees (RRTs) (Blockeel and De Raedt for this problem based on Approximate Dynamic Programming 1998). A key aspect of our approach is that it is model-free, (ADP) is the fitted value iteration algorithm (Boyan which most of the RMDP algorithms assume. The only exception and Moore 1995; Ernst, Geurts, and Wehenkel 2005; Riedmiller is Fern et al. (2006), who directly learn in policy 2005), a batch mode approximation scheme that employs space. Our work differs from their work in that we directly function approximators in each iteration to represent learn value functions and eventually policies from them the value estimates. Another popular class of methods that and adapt the most recently successful relational gradientboosting address this problem is Bellman error based methods (Menache, (RFGB) (Natarajan et al. 2014), which has been Mannor, and Shimkin 2005; Keller, Mannor, and Precup shown to outperform learning relational rules one by one.
Policy Gradient from Demonstration and Curiosity
With reinforcement learning, an agent could learn complex behaviors from highlevel abstractions of the task. However, exploration and reward shaping remained challenging for existing methods, especially in scenarios where the extrinsic feedback was sparse. Expert demonstrations have been investigated to solve these difficulties, but a tremendous number of high-quality demonstrations were usually required. In this work, an integrated policy gradient algorithm was proposed to boost exploration and facilitate intrinsic reward learning from only limited number of demonstrations. We achieved this by reformulating the original reward function with two additional terms, where the first term measured the Jensen-Shannon divergence between current policy and the expert's demonstrations, and the second term estimated the agent's uncertainty about the environment. The presented algorithm was evaluated by a range of simulated tasks with sparse extrinsic reward signals, where only one single demonstrated trajectory was provided to each task. Superior exploration efficiency and high average return were demonstrated in all tasks. Furthermore, it was found that the agent could imitate the expert's behavior and meanwhile sustain high return.
Constrained episodic reinforcement learning in concave-convex and knapsack settings
Brantley, Kianté, Dudik, Miroslav, Lykouris, Thodoris, Miryoosefi, Sobhan, Simchowitz, Max, Slivkins, Aleksandrs, Sun, Wen
Standard reinforcement learning (RL) approaches seek to maximize a scalar reward (Sutton and Barto, 1998, 2018; Schulman et al., 2015; Mnih et al., 2015), but in many settings this is insufficient, because the desired properties of the agent behavior are better described using constraints. For example, an autonomous vehicle should not only get to the destination, but should also respect safety, fuel efficiency, and human comfort constraints along the way (Le et al., 2019); a robot should not only fulfill its task, but should also control its wear and tear, for example, by limiting the torque exerted on its motors (Tessler et al., 2019). Moreover, in many settings, we wish to satisfy such constraints already during training and not only during the deployment. For example, a power grid, an autonomous vehicle, or a real robotic hardware should avoid costly failures, where the hardware is damaged or humans are harmed, already during training (Leike et al., 2017; Ray et al., 2020). Constraints are also key in additional sequential decision making applications, such as dynamic pricing with limited supply, e.g., (Besbes and Zeevi, 2009; Babaioff et al., 2015), scheduling of resources on a computer cluster (Mao et al., 2016), and imitation learning, where the goal is to stay close to an expert behavior (Syed and Schapire, 2007; Ziebart et al., 2008; Sun et al., 2019).
Causal Discovery from Incomplete Data using An Encoder and Reinforcement Learning
Huang, Xiaoshui, Zhu, Fujin, Holloway, Lois, Haidar, Ali
Discovering causal structure among a set of variables is a fundamental problem in many domains. However, state-of-the-art methods seldom consider the possibility that the observational data has missing values (incomplete data), which is ubiquitous in many real-world situations. The missing value will significantly impair the performance and even make the causal discovery algorithms fail. In this paper, we propose an approach to discover causal structures from incomplete data by using a novel encoder and reinforcement learning (RL). The encoder is designed for missing data imputation as well as feature extraction. In particular, it learns to encode the currently available information (with missing values) into a robust feature representation which is then used to determine where to search the best graph. The encoder is integrated into a RL framework that can be optimized using the actor-critic algorithm. Our method takes the incomplete observational data as input and generates a causal structure graph. Experimental results on synthetic and real data demonstrate that our method can robustly generate causal structures from incomplete data. Compared with the direct combination of data imputation and causal discovery methods, our method performs generally better and can even obtain a performance gain as much as 43.2%.
Automated Design Space Exploration for optimised Deployment of DNN on Arm Cortex-A CPUs
de Prado, Miguel, Mundy, Andrew, Saeed, Rabia, Denna, Maurizio, Pazos, Nuria, Benini, Luca
The spread of deep learning on embedded devices has prompted the development of numerous methods to optimise the deployment of deep neural networks (DNN). Works have mainly focused on: i) efficient DNN architectures, ii) network optimisation techniques such as pruning and quantisation, iii) optimised algorithms to speed up the execution of the most computational intensive layers and, iv) dedicated hardware to accelerate the data flow and computation. However, there is a lack of research on the combination of these methods as the space of approaches becomes too large to test and obtain a globally optimised solution, which leads to suboptimal deployment in terms of latency, accuracy, and memory. In this work, we first detail and analyse the methods to improve the deployment of DNNs across the different levels of software optimisation. Building on this knowledge, we present an automated exploration framework to ease the deployment of DNNs for industrial applications by automatically exploring the design space and learning an optimised solution that speeds up the performance and reduces the memory on embedded CPU platforms. The framework relies on a Reinforcement Learning -based search that, combined with a deep learning inference framework, enables the deployment of DNN implementations to obtain empirical measurements on embedded AI applications. Thus, we present a set of results for state-of-the-art DNNs on a range of Arm Cortex-A CPU platforms achieving up to 4x improvement in performance and over 2x reduction in memory with negligible loss in accuracy with respect to the BLAS floating-point implementation.
Hallucinating Value: A Pitfall of Dyna-style Planning with Imperfect Environment Models
Jafferjee, Taher, Imani, Ehsan, Talvitie, Erin, White, Martha, Bowling, Micheal
Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we investigate one type of model error: hallucinated states. These are states generated by the model, but that are not real states of the environment. We present the Hallucinated Value Hypothesis (HVH): updating values of real states towards values of hallucinated states results in misleading state-action values which adversely affect the control policy. We discuss and evaluate four Dyna variants; three which update real states toward simulated -- and therefore potentially hallucinated -- states and one which does not. The experimental results provide evidence for the HVH thus suggesting a fruitful direction toward developing Dyna algorithms robust to model error.