Goto

Collaborating Authors

 Reinforcement Learning


Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones

arXiv.org Artificial Intelligence

Abstract-- Safety remains a central obstacle preventing widespread use of RL in the real world: learning new tasks in uncertain environments requires extensive exploration, but safety requires limiting exploration. We propose Recovery RL, an algorithm which navigates this tradeoff by (1) leveraging offline data to learn about constraint violating zones before policy learning and (2) separating the goals of improving task performance and constraint satisfaction across two policies: a task policy that only optimizes the task reward and a recovery policy that guides the agent to safety when constraint violation is likely. We evaluate Recovery RL on 6 simulation domains, including two contact-rich manipulation tasks and an imagebased navigation task, and an image-based obstacle avoidance task on a physical robot. We compare Recovery RL to 5 prior safe RL methods which jointly optimize for task performance and safety via constrained optimization or reward shaping and find that Recovery RL outperforms the next best prior method across all domains. Results suggest that Recovery RL trades off constraint violations and task successes 2 - 80 times more Figure 1: Recovery RL can safely learn policies for contact-rich tasks efficiently in simulation domains and 3 times more efficiently from high-dimensional image observations in simulation experiments in physical experiments. We evaluate Recovery for videos and supplementary material. For example, consider an agent tasked with learning to extract a carton of milk from a fridge.


Few-Shot Complex Knowledge Base Question Answering via Meta Reinforcement Learning

arXiv.org Artificial Intelligence

Complex question-answering (CQA) involves answering complex natural-language questions on a knowledge base (KB). However, the conventional neural program induction (NPI) approach exhibits uneven performance when the questions have different types, harboring inherently different characteristics, e.g., difficulty level. This paper proposes a meta-reinforcement learning approach to program induction in CQA to tackle the potential distributional bias in questions. Our method quickly and effectively adapts the meta-learned programmer to new questions based on the most similar questions retrieved from the training data. The meta-learned policy is then used to learn a good programming policy, utilizing the trial trajectories and their rewards for similar questions in the support set. Our method achieves state-of-the-art performance on the CQA dataset (Saha et al., 2018) while using only five trial trajectories for the top-5 retrieved questions in each support set, and metatraining on tasks constructed from only 1% of the training set. We have released our code at https://github.com/DevinJake/MRL-CQA.


Low-Variance Policy Gradient Estimation with World Models

arXiv.org Artificial Intelligence

In this paper, we propose World Model Policy Gradient (WMPG), an approach to reduce the variance of policy gradient estimates using learned world models (WM's). In WMPG, a WM is trained online and used to imagine trajectories. The imagined trajectories are used in two ways. Firstly, to calculate a without-replacement estimator of the policy gradient. Secondly, the return of the imagined trajectories is used as an informed baseline. We compare the proposed approach with AC and MAC on a set of environments of increasing complexity (CartPole, LunarLander and Pong) and find that WMPG has better sample efficiency. Based on these results, we conclude that WMPG can yield increased sample efficiency in cases where a robust latent representation of the environment can be learned.


Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient

arXiv.org Artificial Intelligence

Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate against state-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.


Data Science: Supervised Machine Learning in Python

#artificialintelligence

Online Courses Udemy - Full Guide to Implementing Classic Machine Learning Algorithms in Python and with Sci-Kit Learn Created by Lazy Programmer Inc English [Auto-generated], Spanish [Auto-generated] Students also bought Bayesian Machine Learning in Python: A/B Testing The Complete Python Course Learn Python by Doing Complete Python Developer in 2020: Zero to Mastery Artificial Intelligence: Reinforcement Learning in Python Natural Language Processing with Deep Learning in Python Preview this course GET COUPON CODE Description In recent years, we've seen a resurgence in AI, or artificial intelligence, and machine learning. Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts. Google's AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning. Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.


Estimating the Impact of Training Data with Reinforcement Learning

#artificialintelligence

Posted by Jinsung Yoon and Sercan O. Arik, Research Scientists, Cloud AI Team, Google Research Recent work suggests that not all data sam...


What is Reinforcement Learning and how does it function?

#artificialintelligence

Reinforcement learning (RL) is a subset of machine learning (ML). It allows an agent to learn through the repercussions of actions in a specific ecosystem. It can be used to train a robot with new tricks. It is a behavioral learning model where the algorithm offers data analysis feedback, directing the user to get the best outcome. It varies from other forms of supervised learning as the sample data set does not train the machine. It learns by trial and error, instead.


UK Researchers Say AI Needs More Animal Sense

#artificialintelligence

The incomplete understanding of human brains and how to endow computers with common sense are among AI's most enduring challenges. New research from DeepMind London, Imperial College London and the University of Cambridge argues that common sense in humans is founded on a set of basic capacities that are also possessed by many other animals, and that animal cognition can therefore serve as inspiration for many AI tasks and curricula. In a paper published in Trends in Cognitive Sciences journal this month, the researchers identify just how much AI research might benefit from the field of animal cognition. There is no universally accepted definition of "common sense." While much research has used language as a touchstone, the new paper temporarily sets language aside to focus on other common sense capacities found in non-human animals. They such believe capacities pertaining to the understanding of everyday concepts such as objects, space, and causality are also a baseline for humans, and this "foundational layer of common sense, which is a prerequisite for human-level intelligence" could provide something that's lacking in today's AI systems.


Provably Efficient Online Agnostic Learning in Markov Games

arXiv.org Machine Learning

We study online agnostic learning, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret, and present an algorithm that achieves after $K$ episodes a sublinear $\tilde{\mathcal{O}}(K^{3/4})$ regret. This is the first sublinear regret bound (to our knowledge) in the online agnostic setting. Importantly, our regret bound is independent of the size of the opponents' action spaces. As a result, even when the opponents' actions are fully observable, our regret bound improves upon existing analysis (e.g., (Xie et al., 2020)) by an exponential factor in the number of opponents.


Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning

arXiv.org Artificial Intelligence

Despite empirical success, the theory of reinforcement learning (RL) with value function approximation remains fundamentally incomplete. Prior work has identified a variety of pathological behaviours that arise in RL algorithms that combine approximate on-policy evaluation and greedification. One prominent example is policy oscillation, wherein an algorithm may cycle indefinitely between policies, rather than converging to a fixed point. What is not well understood however is the quality of the policies in the region of oscillation. In this paper we present simple examples illustrating that in addition to policy oscillation and multiple fixed points -- the same basic issue can lead to convergence to the worst possible policy for a given approximation. Such behaviours can arise when algorithms optimize evaluation accuracy weighted by the distribution of states that occur under the current policy, but greedify based on the value of states which are rare or nonexistent under this distribution. This means the values used for greedification are unreliable and can steer the policy in undesirable directions. Our observation that this can lead to the worst possible policy shows that in a general sense such algorithms are unreliable. The existence of such examples helps to narrow the kind of theoretical guarantees that are possible and the kind of algorithmic ideas that are likely to be helpful. We demonstrate analytically and experimentally that such pathological behaviours can impact a wide range of RL and dynamic programming algorithms; such behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.