"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.
Eike Germann, Introduction to Reinforcement Learning What is Reinforcement Learning? How is it different from the machine learning we're familiar with? I'll present some foundational ideas (Markov decision processes, policy iteration, value iteration etc) and talk about their limitations. What algorithms are currently used to address those limitations and how do they do it? Based on these, I'll give a short overview of what RL is used currently used for - from training a machine to play space invaders to robotic movement in the real world.
A Google researcher has highlighted some of the hilarious ways that artificial intelligence (AI) software has'cheated' to fulfil its purpose. A programme designed not to lose at Tetris completed its task by simply pausing the game, while a self-driving car simulator asked to keep cars'fast and safe' did so by making them spin on the spot. An AI programmed to spot cancerous skin lesions learned to flag blemishes pictured next to a ruler, as they indicated humans were already concerned about them. Victoria Krakovna, of Google's DeepMind AI lab, asked her colleagues for examples of misbehaving AI to highlight an often overlooked danger of the technology. She said that the biggest threat posed by AI was not that they disobeyed us, but that they obeyed us in the wrong way.
It learn from interaction with environment to achieve a goal or simply learns from reward and punishments. This learning is inspired by behaviourist phycology. From the best research I got the answer as it got termed in 1980's while some research study was conducted on animals behaviour. Especially how some new born baby animals learns to stand, run, and survive in the given environment. Rewards is a survival from learning and punishment can be compared with being eaten by others.
In contrast to the intense studies of deep Reinforcement Learning(RL) in games and simulations , employing deep RL to real world robots remains challenging, especially in high risk scenarios. Though there has been some progresses in RL based control in realistic robotics [2, 3, 4, 5], most of those previous works does not specifically deal with the safety concerns in the RL training process. For majority of high risk scenarios in real world, deep RL still suffer from bottlenecks both in cost and safety. As an example, collisions are extremely dangerous for UAV, while RL training requires thousands of times of collisions. Other works contributes to building simulation environments and bridging the gap between reality and simulation [4, 5]. However, building such simulation environment is arduous, not to mention that the gap can not be totally made up. To address the safety issue in real-world RL training, we present the Intervention Aided Reinforcement Learning (IARL) framework. Intervention is commonly used in many automatic control systems in real world for safety insurance. It is also regarded as an important evaluation criteria for autonomous navigation systems, e.g. the disengagement ratio in autonomous driving
To solve complex real-world problems with reinforcement learning, we cannot rely on manually specified reward functions. Instead, we can have humans communicate an objective to the agent directly. In this work, we combine two approaches to learning from human feedback: expert demonstrations and trajectory preferences. We train a deep neural network to model the reward function and use its predicted reward to train an DQN-based deep reinforcement learning agent on 9 Atari games. Our approach beats the imitation learning baseline in 7 games and achieves strictly superhuman performance on 2 games without using game rewards. Additionally, we investigate the goodness of fit of the reward model, present some reward hacking problems, and study the effects of noise in the human labels.
Learning policies on data synthesized by models can in principle quench the thirst of reinforcement learning algorithms for large amounts of real experience, which is often costly to acquire. However, simulating plausible experience de novo is a hard problem for many complex environments, often resulting in biases for model-based policy evaluation and search. Instead of de novo synthesis of data, here we assume logged, real experience and model alternative outcomes of this experience under counterfactual actions, i.e. actions that were not actually taken. Based on this, we propose the Counterfactually-Guided Policy Search (CF-GPS) algorithm for learning policies in POMDPs from off-policy experience. CF-GPS can improve on vanilla model-based RL algorithms by making use of available logged data to de-bias model predictions. In contrast to off-policy algorithms based on Importance Sampling which re-weight data, CF-GPS leverages a model to explicitly consider alternative outcomes, allowing the algorithm to make better use of experience data. We find empirically that these advantages translate into improved policy evaluation and search results on a nontrivial grid-world task. Finally, we show that CF-GPS generalizes the previously proposed Guided Policy Search and that reparameterization-based algorithms such Stochastic V alue Gradient can be interpreted as counterfactual methods. This example tries to illustrate the everyday human capacity to reason about alternate, counterfactual outcomes of past experience with the goal of "mining worlds that could have been" (Pearl & Mackenzie, 2018). Social psychologists theorize that such cognitive processes are beneficial for improving future decision making (Roese, 1997). In this paper we aim to leverage possible advantages of counterfactual reasoning for learning decision making in the reinforcement learning (RL) framework. In spite of recent success, learning policies with standard, model-free RL algorithms can be notoriously data inefficient. This issue can in principle be addressed by learning policies on data synthesized from a model.
Policy gradient methods are very attractive in reinforcement learning due to their model-free nature and convergence guarantees. These methods, however, suffer from high variance in gradient estimation, resulting in poor sample efficiency. To mitigate this issue, a number of variance-reduction approaches have been proposed. Unfortunately, in the challenging problems with delayed rewards, these approaches either bring a relatively modest improvement or do reduce variance at expense of introducing a bias and undermining convergence. The unbiased methods of gradient estimation, in general, only partially reduce variance, without eliminating it completely even in the limit of exact knowledge of the value functions and problem dynamics, as one might have wished. In this work we propose an unbiased method that does completely eliminate variance under some, commonly encountered, conditions. Of practical interest is the limit of deterministic dynamics and small policy stochasticity. In the case of a quadratic value function, as in linear quadratic Gaussian models, the policy randomness need not be small. We use such a model to analyze performance of the proposed variance-elimination approach and compare it with standard variance-reduction methods. The core idea behind the approach is to use control variates at all future times down the trajectory. We present both a model-based and model-free formulations.
Robustness is important for sequential decision making in a stochastic dynamic environment with uncertain probabilistic parameters. We address the problem of using robust MDPs (RMDPs) to compute policies with provable worst-case guarantees in reinforcement learning. The quality and robustness of an RMDP solution is determined by its ambiguity set. Existing methods construct ambiguity sets that lead to impractically conservative solutions. In this paper, we propose RSVF, which achieves less conservative solutions with the same worst-case guarantees by 1) leveraging a Bayesian prior, 2) optimizing the size and location of the ambiguity set, and, most importantly, 3) relaxing the requirement that the set is a confidence interval. Our theoretical analysis shows the safety of RSVF, and the empirical results demonstrate its practical promise.
While current benchmark reinforcement learning (RL) tasks have been useful to drive progress in the field, they are in many ways poor substitutes for learning with real-world data. By testing increasingly complex RL algorithms on low-complexity simulation environments, we often end up with brittle RL policies that generalize poorly beyond the very specific domain. To combat this, we propose three new families of benchmark RL domains that contain some of the complexity of the natural world, while still supporting fast and extensive data acquisition. The proposed domains also permit a characterization of generalization through fair train/test separation, and easy comparison and replication of results. Through this work, we challenge the RL research community to develop more robust algorithms that meet high standards of evaluation.
Abstract--One less addressed issue of deep reinforcement learning is the lack of generalization capability based on new state and new target, for complex tasks, it is necessary to give the correct strategy and evaluate all possible actions for current state. Fortunately, deep reinforcement learning has enabled enormous progress in both subproblems: giving the correct strategy and evaluating all actions based on the state. In this paper we present an approach called orthogonal policy gradient descent(OPGD) that can make agent learn the policy gradient based on the current state and the actions set, by which the agent can learn a policy network with generalization capability. The framework of the proposed method to implement the autonomous driving. In this paper we proposed a deep reinforcement learning(DRL) method called orthogonal policy gradient descent, which is prooved that the global optimization objective function can reach maximum value and is used in the application of autonomous driving.