AITopics

Abstract-- Safety remains a central obstacle preventing widespread use of RL in the real world: learning new tasks in uncertain environments requires extensive exploration, but safety requires limiting exploration. We propose Recovery RL, an algorithm which navigates this tradeoff by (1) leveraging offline data to learn about constraint violating zones before policy learning and (2) separating the goals of improving task performance and constraint satisfaction across two policies: a task policy that only optimizes the task reward and a recovery policy that guides the agent to safety when constraint violation is likely. We evaluate Recovery RL on 6 simulation domains, including two contact-rich manipulation tasks and an imagebased navigation task, and an image-based obstacle avoidance task on a physical robot. We compare Recovery RL to 5 prior safe RL methods which jointly optimize for task performance and safety via constrained optimization or reward shaping and find that Recovery RL outperforms the next best prior method across all domains. Results suggest that Recovery RL trades off constraint violations and task successes 2 - 80 times more Figure 1: Recovery RL can safely learn policies for contact-rich tasks efficiently in simulation domains and 3 times more efficiently from high-dimensional image observations in simulation experiments in physical experiments. We evaluate Recovery for videos and supplementary material. For example, consider an agent tasked with learning to extract a carton of milk from a fridge.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2010.1592

Country: North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report (0.70)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Few-Shot Complex Knowledge Base Question Answering via Meta Reinforcement Learning

Hua, Yuncheng, Li, Yuan-Fang, Haffari, Gholamreza, Qi, Guilin, Wu, Tongtong

Complex question-answering (CQA) involves answering complex natural-language questions on a knowledge base (KB). However, the conventional neural program induction (NPI) approach exhibits uneven performance when the questions have different types, harboring inherently different characteristics, e.g., difficulty level. This paper proposes a meta-reinforcement learning approach to program induction in CQA to tackle the potential distributional bias in questions. Our method quickly and effectively adapts the meta-learned programmer to new questions based on the most similar questions retrieved from the training data. The meta-learned policy is then used to learn a good programming policy, utilizing the trial trajectories and their rewards for similar questions in the support set. Our method achieves state-of-the-art performance on the CQA dataset (Saha et al., 2018) while using only five trial trajectories for the top-5 retrieved questions in each support set, and metatraining on tasks constructed from only 1% of the training set. We have released our code at https://github.com/DevinJake/MRL-CQA.

machine learning, natural language, reinforcement learning, (21 more...)

2010.15877

Country:

Europe > Austria (0.05)
Asia > India (0.05)
Europe > United Kingdom > Wales (0.04)
(5 more...)

Genre: Research Report (0.50)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Nauman, Michal, Hengst, Floris Den

Low-Variance Policy Gradient Estimation with World Models

In this paper, we propose World Model Policy Gradient (WMPG), an approach to reduce the variance of policy gradient estimates using learned world models (WM's). In WMPG, a WM is trained online and used to imagine trajectories. The imagined trajectories are used in two ways. Firstly, to calculate a without-replacement estimator of the policy gradient. Secondly, the return of the imagined trajectories is used as an informed baseline. We compare the proposed approach with AC and MAC on a set of environments of increasing complexity (CartPole, LunarLander and Pong) and find that WMPG has better sample efficiency. Based on these results, we conclude that WMPG can yield increased sample efficiency in cases where a robust latent representation of the environment can be learned.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

2010.15622

Country:

Asia > Middle East > Jordan (0.04)
Europe > Netherlands > North Holland > Amsterdam (0.04)

Genre: Research Report (1.00)

Industry: Leisure & Entertainment (0.68)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.68)

Tosatto, Samuele, Carvalho, João, Peters, Jan

Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient

Off-policy Reinforcement Learning (RL) holds the promise of better data efficiency as it allows sample reuse and potentially enables safe interaction with the environment. Current off-policy policy gradient methods either suffer from high bias or high variance, delivering often unreliable estimates. The price of inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited, and a very high sample cost hinders straightforward application. In this paper, we propose a nonparametric Bellman equation, which can be solved in closed form. The solution is differentiable w.r.t the policy parameters and gives access to an estimation of the policy gradient. In this way, we avoid the high variance of importance sampling approaches, and the high bias of semi-gradient methods. We empirically analyze the quality of our gradient estimate against state-of-the-art methods, and show that it outperforms the baselines in terms of sample efficiency on classical control tasks.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

2010.14771

Country:

Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
North America > United States > Washington > King County > Seattle (0.04)
North America > United States > Washington > King County > Bellevue (0.04)
(8 more...)

Genre: Research Report > Promising Solution (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

#artificialintelligenceOct-28-2020, 22:55:15 GMT

Data Science: Supervised Machine Learning in Python

Online Courses Udemy - Full Guide to Implementing Classic Machine Learning Algorithms in Python and with Sci-Kit Learn Created by Lazy Programmer Inc English [Auto-generated], Spanish [Auto-generated] Students also bought Bayesian Machine Learning in Python: A/B Testing The Complete Python Course Learn Python by Doing Complete Python Developer in 2020: Zero to Mastery Artificial Intelligence: Reinforcement Learning in Python Natural Language Processing with Deep Learning in Python Preview this course GET COUPON CODE Description In recent years, we've seen a resurgence in AI, or artificial intelligence, and machine learning. Machine learning has led to some amazing results, like being able to analyze medical images and predict diseases on-par with human experts. Google's AlphaGo program was able to beat a world champion in the strategy game go using deep reinforcement learning. Machine learning is even being used to program self driving cars, which is going to change the automotive industry forever. Imagine a world with drastically reduced car accidents, simply by removing the element of human error.

artificial intelligence, machine learning, reinforcement learning, (11 more...)

Genre: Instructional Material > Course Syllabus & Notes (0.74)

Industry:

Education > Educational Setting > Online (1.00)
Education > Educational Technology > Educational Software > Computer Based Training (0.57)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.79)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.75)
Information Technology > Enterprise Applications > Human Resources > Learning Management (0.57)
(3 more...)

#artificialintelligenceOct-28-2020, 19:38:47 GMT

Estimating the Impact of Training Data with Reinforcement Learning

Posted by Jinsung Yoon and Sercan O. Arik, Research Scientists, Cloud AI Team, Google Research Recent work suggests that not all data sam...

artificial intelligence, machine learning, reinforcement learning, (1 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

#artificialintelligenceOct-28-2020, 11:01:02 GMT

What is Reinforcement Learning and how does it function?

Reinforcement learning (RL) is a subset of machine learning (ML). It allows an agent to learn through the repercussions of actions in a specific ecosystem. It can be used to train a robot with new tricks. It is a behavioral learning model where the algorithm offers data analysis feedback, directing the user to get the best outcome. It varies from other forms of supervised learning as the sample data set does not train the machine. It learns by trial and error, instead.

artificial intelligence, machine learning, reinforcement learning, (11 more...)

Industry: Leisure & Entertainment > Games > Computer Games (0.33)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

#artificialintelligenceOct-28-2020, 02:15:36 GMT

UK Researchers Say AI Needs More Animal Sense

The incomplete understanding of human brains and how to endow computers with common sense are among AI's most enduring challenges. New research from DeepMind London, Imperial College London and the University of Cambridge argues that common sense in humans is founded on a set of basic capacities that are also possessed by many other animals, and that animal cognition can therefore serve as inspiration for many AI tasks and curricula. In a paper published in Trends in Cognitive Sciences journal this month, the researchers identify just how much AI research might benefit from the field of animal cognition. There is no universally accepted definition of "common sense." While much research has used language as a touchstone, the new paper temporarily sets language aside to focus on other common sense capacities found in non-human animals. They such believe capacities pertaining to the understanding of everyday concepts such as objects, space, and causality are also a baseline for humans, and this "foundational layer of common sense, which is a prerequisite for human-level intelligence" could provide something that's lacking in today's AI systems.

artificial intelligence, machine learning, reinforcement learning, (10 more...)

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.25)
Asia > China (0.07)

Genre: Research Report > New Finding (0.72)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.38)

arXiv.org Machine LearningOct-28-2020

Provably Efficient Online Agnostic Learning in Markov Games

Tian, Yi, Wang, Yuanhao, Yu, Tiancheng, Sra, Suvrit

We study online agnostic learning, a problem that arises in episodic multi-agent reinforcement learning where the actions of the opponents are unobservable. We show that in this challenging setting, achieving sublinear regret against the best response in hindsight is statistically hard. We then consider a weaker notion of regret, and present an algorithm that achieves after $K$ episodes a sublinear $\tilde{\mathcal{O}}(K^{3/4})$ regret. This is the first sublinear regret bound (to our knowledge) in the online agnostic setting. Importantly, our regret bound is independent of the size of the opponents' action spaces. As a result, even when the opponents' actions are fully observable, our regret bound improves upon existing analysis (e.g., (Xie et al., 2020)) by an exponential factor in the number of opponents.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

2010.1502

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre:

Workflow (0.46)
Research Report (0.40)
Instructional Material (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.88)

Young, Kenny, Sutton, Richard S.

Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning

arXiv.org Artificial IntelligenceOct-28-2020

Despite empirical success, the theory of reinforcement learning (RL) with value function approximation remains fundamentally incomplete. Prior work has identified a variety of pathological behaviours that arise in RL algorithms that combine approximate on-policy evaluation and greedification. One prominent example is policy oscillation, wherein an algorithm may cycle indefinitely between policies, rather than converging to a fixed point. What is not well understood however is the quality of the policies in the region of oscillation. In this paper we present simple examples illustrating that in addition to policy oscillation and multiple fixed points -- the same basic issue can lead to convergence to the worst possible policy for a given approximation. Such behaviours can arise when algorithms optimize evaluation accuracy weighted by the distribution of states that occur under the current policy, but greedify based on the value of states which are rare or nonexistent under this distribution. This means the values used for greedification are unreliable and can steer the policy in undesirable directions. Our observation that this can lead to the worst possible policy shows that in a general sense such algorithms are unreliable. The existence of such examples helps to narrow the kind of theoretical guarantees that are possible and the kind of algorithmic ideas that are likely to be helpful. We demonstrate analytically and experimentally that such pathological behaviours can impact a wide range of RL and dynamic programming algorithms; such behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

2010.15268

Country:

North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine > Diagnostic Medicine (0.42)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)