AITopics | Reinforcement Learning

Collaborating Authors

Reinforcement Learning

"Reinforcement learning is learning what to do – how to map situations to actions – so as to maximize a numerical reward signal. The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them."
– Sutton, Richard S. and Andrew G. Barto. Reinforcement Learning: An Introduction. (1.1). MIT Press, Cambridge, MA, 1998.

News Overviews Instructional Materials AI-Alerts Classics

Last Week in AI

#artificialintelligenceJun-24-2020, 12:11:35 GMT

Every week, Invector Labs publishes a newsletter that covers the most recent developments in AI research and technology. You can find this week's issue below. You can sign up for it below. Vision and language are two of the fundamental pillars of human cognition. Since we are babies, vision and language are intrinsically related as we try to describe objects or create visual representation of a verbal description.

large language model, machine learning, reinforcement learning, (10 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.35)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

The NetHack Learning Environment

Küttler, Heinrich, Nardelli, Nantas, Miller, Alexander H., Raileanu, Roberta, Selvatici, Marco, Grefenstette, Edward, Rocktäschel, Tim

arXiv.org Artificial IntelligenceJun-24-2020

Progress in Reinforcement Learning (RL) algorithms goes hand-in-hand with the development of challenging environments that test the limits of current methods. While existing RL environments are either sufficiently complex or based on fast simulation, they are rarely both. Here, we present the NetHack Learning Environment (NLE), a scalable, procedurally generated, stochastic, rich, and challenging environment for RL research based on the popular single-player terminal-based roguelike game, NetHack. We argue that NetHack is sufficiently complex to drive long-term research on problems such as exploration, planning, skill acquisition, and language-conditioned RL, while dramatically reducing the computational resources required to gather a large amount of experience. We compare NLE and its task suite to existing alternatives, and discuss why it is an ideal medium for testing the robustness and systematic generalization of RL agents. We demonstrate empirical success for early stages of the game using a distributed Deep RL baseline and Random Network Distillation exploration, alongside qualitative analysis of various agents trained in the environment. NLE is open source at https://github.com/facebookresearch/nle.

machine learning, natural language, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2006.1376

Country:

Europe > Sweden > Skåne County > Malmö (0.04)
Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(2 more...)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment > Games > Computer Games (1.00)
Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Learning Reward Functions from Diverse Sources of Human Feedback: Optimally Integrating Demonstrations and Preferences

Bıyık, Erdem, Losey, Dylan P., Palan, Malayandi, Landolfi, Nicholas C., Shevchuk, Gleb, Sadigh, Dorsa

arXiv.org Artificial IntelligenceJun-24-2020

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, humans provide data in a variety of forms: these include instructions (e.g., natural language), demonstrations, (e.g., kinesthetic guidance), and preferences (e.g., comparative rankings). Prior research has independently applied reward learning to each of these different data sources. However, there exist many domains where some of these information sources are not applicable or inefficient -- while multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then proactively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human's ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.

demonstration, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2006.14091

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)

Genre:

Research Report > New Finding (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Transportation (0.68)
Aerospace & Defense (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
(2 more...)

Add feedback

Reducing Overestimation Bias by Increasing Representation Dissimilarity in Ensemble Based Deep Q-Learning

Sheikh, Hassam Ullah, Bölöni, Ladislau

arXiv.org Artificial IntelligenceJun-24-2020

The first deep RL algorithm, DQN, was limited by the overestimation bias of the learned Q-function. Subsequent algorithms proposed techniques to reduce this problem, without fully eliminating it. Recently, the Maxmin and Ensemble Q-learning algorithms used the different estimates provided by ensembles of learners to reduce the bias. Unfortunately, in many scenarios the learners converge to the same point in the parametric or representation space, falling back to the classic single neural network DQN. In this paper, we describe a regularization technique to increase the dissimilarity in the representation space in these algorithms. We propose and compare five regularization functions inspired from economics theory and consensus optimization. We show that the resulting approach significantly outperforms the Maxmin and Ensemble Q-learning algorithms as well as non-ensemble baselines.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2006.13823

Country:

North America > United States > New York (0.04)
North America > United States > Florida > Hillsborough County > University (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry: Leisure & Entertainment (0.98)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Quantifying Differences in Reward Functions

Gleave, Adam, Dennis, Michael, Legg, Shane, Russell, Stuart, Leike, Jan

arXiv.org Artificial IntelligenceJun-24-2020

For many tasks, the reward function is too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by examining rollouts from a policy optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences, and the reinforcement learning algorithm failing to optimize the learned reward. Moreover, the rollout method is highly sensitive to details of the environment the learned reward is evaluated in, which often differ in the deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without training a policy. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be precisely approximated and is more robust than baselines to the choice of visitation distribution. Finally, we find that the EPIC distance of learned reward functions to the ground-truth reward is predictive of the success of training a policy, even in different transition dynamics.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2006.139

Genre: Research Report (0.82)

Industry: Information Technology > Security & Privacy (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)

Add feedback

Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers

Eysenbach, Benjamin, Asawa, Swapnil, Chaudhari, Shreyas, Salakhutdinov, Ruslan, Levine, Sergey

arXiv.org Artificial IntelligenceJun-24-2020

We propose a simple, practical, and intuitive approach for domain adaptation in reinforcement learning. Our approach stems from the idea that the agent's experience in the source domain should look similar to its experience in the target domain. Building off of a probabilistic view of RL, we formally show that we can achieve this goal by compensating for the difference in dynamics by modifying the reward function. This modified reward function is simple to estimate by learning auxiliary classifiers that distinguish source-domain transitions from target-domain transitions. Intuitively, the modified reward function penalizes the agent for visiting states and taking actions in the source domain which are not possible in the target domain. Said another way, the agent is penalized for transitions that would indicate that the agent is interacting with the source domain, rather than the target domain. Our approach is applicable to domains with continuous states and actions and does not require learning an explicit model of the dynamics. On discrete and continuous control tasks, we illustrate the mechanics of our approach and demonstrate its scalability to high-dimensional tasks.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

2006.13916

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (1.00)

Industry: Transportation (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)

Add feedback

When Will Generative Adversarial Imitation Learning Algorithms Attain Global Convergence

Guan, Ziwei, Xu, Tengyu, Liang, Yingbin

arXiv.org Machine LearningJun-24-2020

Generative adversarial imitation learning (GAIL) is a popular inverse reinforcement learning approach for jointly optimizing policy and reward from expert trajectories. A primary question about GAIL is whether applying a certain policy gradient algorithm to GAIL attains a global minimizer (i.e., yields the expert policy), for which existing understanding is very limited. Such global convergence has been shown only for the linear (or linear-type) MDP and linear (or linearizable) reward. In this paper, we study GAIL under general MDP and for nonlinear reward function classes (as long as the objective function is strongly concave with respect to the reward parameter). We characterize the global convergence with a sublinear rate for a broad range of commonly used policy gradient algorithms, all of which are implemented in an alternating manner with stochastic gradient ascent for reward update, including projected policy gradient (PPG)-GAIL, Frank-Wolfe policy gradient (FWPG)-GAIL, trust region policy optimization (TRPO)-GAIL and natural policy gradient (NPG)-GAIL. This is the first systematic theoretical study of GAIL for global convergence.

algorithm, convergence, parameterization, (15 more...)

arXiv.org Machine Learning

2006.13506

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
North America > United States > Ohio (0.04)
North America > United States > Massachusetts (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.34)

Add feedback

Strictly Batch Imitation Learning by Energy-based Distribution Matching

Jarrett, Daniel, Bica, Ioana, van der Schaar, Mihaela

arXiv.org Machine LearningJun-24-2020

Consider learning a policy purely on the basis of demonstrated behavior---that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment. This *strictly batch imitation learning* problem arises wherever live experimentation is costly, such as in healthcare. One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting. But such an approach bargains heavily on model estimation or off-policy evaluation, and can be indirect and inefficient. We argue that a good solution should be able to explicitly parameterize a policy (i.e. respecting action conditionals), implicitly account for rollout dynamics (i.e. respecting state marginals), and---crucially---operate in an entirely offline fashion. To meet this challenge, we propose a novel technique by *energy-based distribution matching* (EDM): By identifying parameterizations of the (discriminative) model of a policy with the (generative) energy function for state distributions, EDM provides a simple and effective solution that equivalently minimizes a divergence between the occupancy measures of the demonstrator and the imitator. Through experiments with application to control tasks and healthcare settings, we illustrate consistent performance gains over existing algorithms for strictly batch imitation learning.

imitation, international conference, learning, (11 more...)

arXiv.org Machine Learning

2006.14154

Genre: Research Report (0.70)

Industry:

Health & Medicine (1.00)
Education (0.88)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback

Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism

Cheung, Wang Chi, Simchi-Levi, David, Zhu, Ruihao

arXiv.org Machine LearningJun-24-2020

We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, i.e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner, i.e., without knowing the variation budgets. Notably, learning non-stationary MDPs via the conventional optimistic exploration technique presents a unique challenge absent in existing (non-stationary) bandit learning settings. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2006.14389

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > Canada > Quebec > Montreal (0.04)
(2 more...)

Genre: Research Report (0.50)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.85)

Add feedback

Inverse Active Sensing: Modeling and Understanding Timely Decision-Making

Jarrett, Daniel, van der Schaar, Mihaela

arXiv.org Machine LearningJun-24-2020

Evidence-based decision-making entails collecting (costly) observations about an underlying phenomenon of interest, and subsequently committing to an (informed) decision on the basis of accumulated evidence. In this setting, active sensing is the goal-oriented problem of efficiently selecting which acquisitions to make, and when and what decision to settle on. As its complement, inverse active sensing seeks to uncover an agent's preferences and strategy given their observable decision-making behavior. In this paper, we develop an expressive, unified framework for the general setting of evidence-based decision-making under endogenous, context-dependent time pressure---which requires negotiating (subjective) tradeoffs between accuracy, speediness, and cost of information. Using this language, we demonstrate how it enables modeling intuitive notions of surprise, suspense, and optimality in decision strategies (the forward problem). Finally, we illustrate how this formulation enables understanding decision-making behavior by quantifying preferences implicit in observed decision strategies (the inverse problem).

artificial intelligence, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2006.14141

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Europe > Austria > Vienna (0.14)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
(2 more...)

Genre: Research Report (0.81)

Industry:

Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.92)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.93)
(2 more...)

Add feedback