Goto

Collaborating Authors

 Reinforcement Learning


On Polynomial Time PAC Reinforcement Learning with Rich Observations

arXiv.org Machine Learning

We study episodic reinforcement learning (RL) when the observations may be realistically rich, such as images or text. We aim for methods that use function approximation in a provably effective manner to find the best possible policy through systematic exploration. While such problems are central to empirical RL research [22], most theoretical results on systematic exploration have focused on tabular MDPs with small state spaces [e.g., 19]. Until recently, little was known about how to engage in sophisticated exploration in the general function approximation setting to achieve global optimality in a statistically efficient manner. Indeed, as pointed out by Krishnamurthy et al. [20], no algorithm achieving polynomial sample complexity is possible without further assumptions. Nevertheless, when the underlying problem exhibits additional structure, it was recently shown that learning becomes statistically feasible. In particular, Krishnamurthy et al. [20] showed that reactive POMDPs with rich observations and deterministic dynamics over M hidden states can be learned with polynomial sample complexity that depends on M. Later, Jiang et al. [16] provided a new algorithm called O LIVE that In this paper, we directly address this difficult computational challenge. We adopt a reduction approach, meaning that we aim to design algorithms whose computation can be reduced to common optimization oracles over function spaces, such as linear optimization and cost-sensitive classification, while retaining the statistical properties of prior works.


Inverse Reinforcement Learning via Nonparametric Spatio-Temporal Subgoal Modeling

arXiv.org Machine Learning

Recent advances in the field of inverse reinforcement learning (IRL) have yielded sophisticated frameworks which relax the original modeling assumption that the behavior of an observed agent reflects only a single intention. Instead, the demonstration data is typically divided into parts, to account for the fact that different trajectories may correspond to different intentions, e.g., because they were generated by different domain experts. In this work, we go one step further: using the intuitive concept of subgoals, we build upon the premise that even a single trajectory can be explained more efficiently locally within a certain context than globally, enabling a more compact representation of the observed behavior. Based on this assumption, we build an implicit intentional model of the agent's goals to forecast its behavior in unobserved situations. The result is an integrated Bayesian prediction framework which provides smooth policy estimates that are consistent with the expert's plan and significantly outperform existing IRL solutions. Most notably, our framework naturally handles situations where the intentions of the agent change with time and classical IRL algorithms fail. In addition, due to its probabilistic nature, the model can be straightforwardly applied in an active learning setting to guide the demonstration process of the expert.


Ingredients for Robotics Research

#artificialintelligence

This release includes four environments using the Fetch research platform and four environments using the ShadowHand robot. The manipulation tasks contained in these environments are significantly more difficult than the MuJoCo continuous control environments currently available in Gym, all of which are now easily solvable using recently released algorithms like PPO. Furthermore, our newly released environments use models of real robots and require the agent to solve realistic tasks. FetchReach-v0: Fetch has to move its end-effector to the desired goal position. FetchSlide-v0: Fetch has to hit a puck across a long table such that it slides and comes to rest on the desired goal.


Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

arXiv.org Machine Learning

Recent model-free reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present model-based value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.


Learning by Playing - Solving Sparse Reward Tasks from Scratch

arXiv.org Machine Learning

We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors - from scratch - in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment - enabling it to excel at sparse reward RL. Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach. A video of the rich set of learned behaviours can be found at https://youtu.be/mPKyvocNe M.


Deep Reinforcement Learning for Vision-Based Robotic Grasping: A Simulated Comparative Evaluation of Off-Policy Methods

arXiv.org Machine Learning

In this paper, we explore deep reinforcement learning algorithms for vision-based robotic grasping. Model-free deep reinforcement learning (RL) has been successfully applied to a range of challenging environments, but the proliferation of algorithms makes it difficult to discern which particular approach would be best suited for a rich, diverse task like grasping. To answer this question, we propose a simulated benchmark for robotic grasping that emphasizes off-policy learning and generalization to unseen objects. Off-policy learning enables utilization of grasping data over a wide variety of objects, and diversity is important to enable the method to generalize to new objects that were not seen during training. We evaluate the benchmark tasks against a variety of Q-function estimation methods, a method previously proposed for robotic grasping with deep neural network models, and a novel approach based on a combination of Monte Carlo return estimation and an off-policy correction. Our results indicate that several simple methods provide a surprisingly strong competitor to popular algorithms such as double Q-learning, and our analysis of stability sheds light on the relative tradeoffs between the algorithms.


An intro to Reinforcement Learning (with otters) โ€“ Monica Dinculescu

@machinelearnbot

Before I wrote the JavaScripts, I got a master's in AI (almost a decade ago), and wrote a thesis on a weird and new area in Reinforcement Learning. Or at least it was new then. With all the hype around Machine Learning and Deep Learning, I thought it would be neat if I wrote a little primer on what Reinforcement Learning really means, and why it's different than just another neural net. Richard Sutton and Andrew Barto wrote an amazing book called "Reinforcement Learning: an introduction"; it's my favourite non-fiction book I have ever read in my life, and it's why I fell in love with RL. The complete draft is available for free here, and if you're into math, and want to explore this topic further, I can't recommend it enough.


The Policy of Truth

#artificialintelligence

This is the sixth part of "An Outsider's Tour of Reinforcement Learning." Our first generic candidate for solving reinforcement learning is Policy Gradient. I find it shocking that Policy Gradient wasn't ruled out as a bad idea in 1993. Policy gradient is seductive as it apparently lets one fine tune a program to solve any problem without any domain knowledge. Of course, anything that makes such a claim must be too general for its own good.



DiGrad: Multi-Task Reinforcement Learning with Shared Actions

arXiv.org Machine Learning

Most reinforcement learning algorithms are inefficient for learning multiple tasks in complex robotic systems, where different tasks share a set of actions. In such environments a compound policy may be learnt with shared neural network parameters, which performs multiple tasks concurrently. However such compound policy may get biased towards a task or the gradients from different tasks negate each other, making the learning unstable and sometimes less data efficient. In this paper, we propose a new approach for simultaneous training of multiple tasks sharing a set of common actions in continuous action spaces, which we call as DiGrad (Differential Policy Gradient). The proposed framework is based on differential policy gradients and can accommodate multi-task learning in a single actor-critic network. We also propose a simple heuristic in the differential policy gradient update to further improve the learning. The proposed architecture was tested on 8 link planar manipulator and 27 degrees of freedom(DoF) Humanoid for learning multi-goal reachability tasks for 3 and 2 end effectors respectively. We show that our approach supports efficient multi-task learning in complex robotic systems, outperforming related methods in continuous action spaces.