Goto

Collaborating Authors

 Reinforcement Learning


An Introduction to Reinforcement Learning - Lex Fridman, MIT

#artificialintelligence

We were delighted to be joined by Lex Fridman at the San Francisco edition of the Deep Learning Summit, taking part in both a'Deep Dive' session, allowing for a great amount of attendee interaction and collaboration, alongside a fireside chat with OpenAI Co-Founder & Chief Scientist, Ilya Sutskever. The MIT Researcher shared his thoughts on recent developments in AI and its current standing, highlighting its growth in recent years. Lex then referenced, Lee Sedol, the South Korean 9th Dan GO player, whom at this time is the only human to ever beat AI at a video game, which has since become somewhat of an impossible task, describing this feat as a seminal moment and one which changed the course of not only deep learning but also reinforcement learning, increasing the social belief in the subsection of AI. Since then, of course, we have seen video games and tactically based games, including Starcraft become imperative in the development of AI. The comparison of Reinforcement Learning to Human Learning is something which we often come across, referenced by Lex as something which needed addressing, with humans seemingly learning through "very few examples" as opposed to the heavy data sets needed in AI, but why is that?


D4RL: Datasets for Deep Data-Driven Reinforcement Learning

arXiv.org Machine Learning

The offline reinforcement learning (RL) problem, also known as batch RL, refers to the setting where a policy must be learned from a static dataset, without additional online data collection. This setting is compelling as potentially it allows RL methods to take advantage of large, pre-collected datasets, much like how the rise of large datasets has fueled results in supervised learning in recent years. However, existing online RL benchmarks are not tailored towards the offline setting, making progress in offline RL difficult to measure. In this work, we introduce benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL. Examples of such properties include: datasets generated via hand-designed controllers and human demonstrators, multi-objective datasets where an agent can perform different tasks in the same environment, and datasets consisting of a mixtures of policies. To facilitate research, we release our benchmark tasks and datasets with a comprehensive evaluation of existing algorithms and an evaluation protocol together with an open-source codebase. We hope that our benchmark will focus research effort on methods that drive improvements not just on simulated tasks, but ultimately on the kinds of real-world problems where offline RL will have the largest impact.


Intel's New AI System Can Optimise Reinforcement Learning Training On A Single System

#artificialintelligence

Sample Factory is a high-throughput training system optimised for a single-machine setting that combines a highly efficient, asynchronous, GPU-basedย โ€ฆ


Explore Fundamental Concepts of Reinforcement Learning

#artificialintelligence

We have seen that rewards (sometimes negative rewards are called penalties, but it's preferable to use a standardized notation) are the only feedback provided by the environment after each action. However, there are two different approaches to the use of rewards. The first one is the strategy of a very short-sighted agent and consists of taking into account only the reward just received. The main problem with this approach is clearly the inability to consider longer sequences that can lead to a very high reward. For example, an agent has to traverse a few states with a negative reward (for example, -0.1), but after them, they arrive at a state with a very positive reward (for example, 5.0).


Facebook releases AI development tool based on NetHack

#artificialintelligence

Facebook researchers believe the game NetHack is well-tailored to training, testing, and evaluating AI models. To this end, they today released the NetHack Learning Environment, a research tool for benchmarking the robustness and generalization of reinforcement learning agents. For decades, games have served as benchmarks for AI. But things really kicked into gear in 2013 -- the year Google subsidiary DeepMind demonstrated an AI system that could play Pong, Breakout, Space Invaders, Seaquest, Beamrider, Enduro, and Q*bert at superhuman levels. Rather, they're informing the development of systems that might one day diagnose illnesses, predict complicated protein structures, and segment CT scans.


A Closer Look at Invalid Action Masking in Policy Gradient Algorithms

arXiv.org Artificial Intelligence

In recent years, Deep Reinforcement Learning (DRL) algorithms have achieved state-of-the-art performance in many challenging strategy games. Because these games have complicated rules, an action sampled from the full discrete action space will typically be invalid. The usual approach to deal with this problem in policy gradient algorithms is to "mask out" invalid actions and just sample from the set of valid actions. The implications of this process, however, remain under-investigated. In this paper, we show that the standard working mechanism of invalid action masking corresponds to valid policy gradient updates. More interestingly, it works by applying a state-dependent differentiable function during the calculation of action probability distribution. Additionally, we show its critical importance to the performance of policy gradient algorithms. Specifically, our experiments show that invalid action masking scales well when the space of invalid actions is large, while the common approach of giving negative rewards for invalid actions will fail. Finally, we provide further insights by evaluating different action masking regimes, such as removing masking after an agent has been trained using masking.


What can I do here? A Theory of Affordances in Reinforcement Learning

arXiv.org Artificial Intelligence

Reinforcement learning algorithms usually assume that all actions are always available to an agent. However, both people and animals understand the general link between the features of their environment and the actions that are feasible. Gibson (1977) coined the term "affordances" to describe the fact that certain states enable an agent to do certain actions, in the context of embodied agents. In this paper, we develop a theory of affordances for agents who learn and plan in Markov Decision Processes. Affordances play a dual role in this case. On one hand, they allow faster planning, by reducing the number of actions available in any given situation. On the other hand, they facilitate more efficient and precise learning of transition models from data, especially when such models require function approximation. We establish these properties through theoretical results as well as illustrative examples. We also propose an approach to learn affordances and use it to estimate transition models that are simpler and generalize better.


Perception-Prediction-Reaction Agents for Deep Reinforcement Learning

arXiv.org Artificial Intelligence

We introduce a new recurrent agent architecture and associated auxiliary losses which improve reinforcement learning in partially observable tasks requiring long-term memory. We employ a temporal hierarchy, using a slow-ticking recurrent core to allow information to flow more easily over long time spans, and three fast-ticking recurrent cores with connections designed to create an information asymmetry. The \emph{reaction} core incorporates new observations with input from the slow core to produce the agent's policy; the \emph{perception} core accesses only short-term observations and informs the slow core; lastly, the \emph{prediction} core accesses only long-term memory. An auxiliary loss regularizes policies drawn from all three cores against each other, enacting the prior that the policy should be expressible from either recent or long-term memory. We present the resulting \emph{Perception-Prediction-Reaction} (PPR) agent and demonstrate its improved performance over a strong LSTM-agent baseline in DMLab-30, particularly in tasks requiring long-term memory. We further show significant improvements in Capture the Flag, an environment requiring agents to acquire a complicated mixture of skills over long time scales. In a series of ablation experiments, we probe the importance of each component of the PPR agent, establishing that the entire, novel combination is necessary for this intriguing result.


DDPG++: Striving for Simplicity in Continuous-control Off-Policy Reinforcement Learning

arXiv.org Machine Learning

This paper prescribes a suite of techniques for off-policy Reinforcement Learning (RL) that simplify the training process and reduce the sample complexity. First, we show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled. This is contrast to existing literature which creates sophisticated off-policy techniques. Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step; existing solutions such as delayed policy updates do not mitigate this issue. Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from the replay buffer and selectively update the policy to prevent deterioration of performance. We make these claims using extensive experimentation on a set of challenging MuJoCo tasks. A short video of our results can be seen at https://tinyurl.com/scs6p5m .


Policy-GNN: Aggregation Optimization for Graph Neural Networks

arXiv.org Machine Learning

Graph data are pervasive in many real-world applications. Recently, increasing attention has been paid on graph neural networks (GNNs), which aim to model the local graph structures and capture the hierarchical patterns by aggregating the information from neighbors with stackable network modules. Motivated by the observation that different nodes often require different iterations of aggregation to fully capture the structural information, in this paper, we propose to explicitly sample diverse iterations of aggregation for different nodes to boost the performance of GNNs. It is a challenging task to develop an effective aggregation strategy for each node, given complex graphs and sparse features. Moreover, it is not straightforward to derive an efficient algorithm since we need to feed the sampled nodes into different number of network layers. To address the above challenges, we propose Policy-GNN, a meta-policy framework that models the sampling procedure and message passing of GNNs into a combined learning process. Specifically, Policy-GNN uses a meta-policy to adaptively determine the number of aggregations for each node. The meta-policy is trained with deep reinforcement learning (RL) by exploiting the feedback from the model. We further introduce parameter sharing and a buffer mechanism to boost the training efficiency. Experimental results on three real-world benchmark datasets suggest that Policy-GNN significantly outperforms the state-of-the-art alternatives, showing the promise in aggregation optimization for GNNs.