Goto

Collaborating Authors

 Reinforcement Learning


PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals

arXiv.org Artificial Intelligence

Learning with sparse rewards remains a significant challenge in reinforcement learning (RL), especially when the aim is to train a policy capable of achieving multiple different goals. To date, the most successful approaches for dealing with multi-goal, sparse reward environments have been model-free RL algorithms. In this work we propose PlanGAN, a model-based algorithm specifically designed for solving multi-goal tasks in environments with sparse rewards. Our method builds on the fact that any trajectory of experience collected by an agent contains useful information about how to achieve the goals observed during that trajectory. We use this to train an ensemble of conditional generative models (GANs) to generate plausible trajectories that lead the agent from its current state towards a specified goal. We then combine these imagined trajectories into a novel planning algorithm in order to achieve the desired goal as efficiently as possible. The performance of PlanGAN has been tested on a number of robotic navigation/manipulation tasks in comparison with a range of model-free reinforcement learning baselines, including Hindsight Experience Replay. Our studies indicate that PlanGAN can achieve comparable performance whilst being around 4-8 times more sample efficient.


Probing Emergent Semantics in Predictive Agents via Question Answering

arXiv.org Artificial Intelligence

Recent work has shown how predictive modeling can endow agents with rich knowledge of their surroundings, improving their ability to act in complex environments. We propose question-answering as a general paradigm to decode and understand the representations that such agents develop, applying our method to two recent approaches to predictive modeling -action-conditional CPC (Guo et al., 2018) and SimCore (Gregor et al., 2019). After training agents with these predictive objectives in a visually-rich, 3D environment with an assortment of objects, colors, shapes, and spatial configurations, we probe their internal state representations with synthetic (English) questions, without backpropagating gradients from the question-answering decoder into the agent. The performance of different agents when probed this way reveals that they learn to encode factual, and seemingly compositional, information about objects, properties and spatial relations from their physical environment. Our approach is intuitive, i.e. humans can easily interpret responses of the model as opposed to inspecting continuous vectors, and model-agnostic, i.e. applicable to any modeling approach. By revealing the implicit knowledge of objects, quantities, properties and relations acquired by agents as they learn, question-conditional agent probing can stimulate the design and development of stronger predictive learning objectives.


Acme: A Research Framework for Distributed Reinforcement Learning

arXiv.org Artificial Intelligence

Deep reinforcement learning has led to many recent-and groundbreaking-advancements. However, these advances have often come at the cost of both the scale and complexity of the underlying RL algorithms. Increases in complexity have in turn made it more difficult for researchers to reproduce published RL algorithms or rapidly prototype ideas. To address this, we introduce Acme, a tool to simplify the development of novel RL algorithms that is specifically designed to enable simple agent implementations that can be run at various scales of execution. Our aim is also to make the results of various RL algorithms developed in academia and industrial labs easier to reproduce and extend. To this end we are releasing baseline implementations of various algorithms, created using our framework. In this work we introduce the major design decisions behind Acme and show how these are used to construct these baselines. We also experiment with these agents at different scales of both complexity and computation-including distributed versions. Ultimately, we show that the design decisions behind Acme lead to agents that can be scaled both up and down and that, for the most part, greater levels of parallelization result in agents with equivalent performance, just faster.


Temporal-Differential Learning in Continuous Environments

arXiv.org Artificial Intelligence

In this paper, a new reinforcement learning (RL) method known as the method of temporal differential is introduced. Compared to the traditional temporal-difference learning method, it plays a crucial role in developing novel RL techniques for continuous environments. In particular, the continuous-time least squares policy evaluation (CT-LSPE) and the continuous-time temporal-differential (CT-TD) learning methods are developed. Both theoretical and empirical evidences are provided to demonstrate the effectiveness of the proposed temporal-differential learning methodology.


Model-Based Reinforcement Learning with Value-Targeted Regression

arXiv.org Machine Learning

This paper studies model-based reinforcement learning (RL) for regret minimization. We focus on finite-horizon episodic RL where the transition model $P$ belongs to a known family of models $\mathcal{P}$, a special case of which is when models in $\mathcal{P}$ take the form of linear mixtures: $P_{\theta} = \sum_{i=1}^{d} \theta_{i}P_{i}$. We propose a model based RL algorithm that is based on optimism principle: In each episode, the set of models that are `consistent' with the data collected is constructed. The criterion of consistency is based on the total squared error of that the model incurs on the task of predicting \emph{values} as determined by the last value estimate along the transitions. The next value function is then chosen by solving the optimistic planning problem with the constructed set of models. We derive a bound on the regret, which, in the special case of linear mixtures, the regret bound takes the form $\tilde{\mathcal{O}}(d\sqrt{H^{3}T})$, where $H$, $T$ and $d$ are the horizon, total number of steps and dimension of $\theta$, respectively. In particular, this regret bound is independent of the total number of states or actions, and is close to a lower bound $\Omega(\sqrt{HdT})$. For a general model family $\mathcal{P}$, the regret bound is derived using the notion of the so-called Eluder dimension proposed by Russo & Van Roy (2014).


Reinforcement learning and Bayesian data assimilation for model-informed precision dosing in oncology

arXiv.org Machine Learning

Model-informed precision dosing (MIPD) using therapeutic drug/biomarker monitoring offers the opportunity to significantly improve the efficacy and safety of drug therapies. Current strategies comprise model-informed dosing tables or are based on maximum a-posteriori estimates. These approaches, however, lack a quantification of uncertainty and/or consider only part of the available patient-specific information. We propose three novel approaches for MIPD employing Bayesian data assimilation (DA) and/or reinforcement learning (RL) to control neutropenia, the major dose-limiting side effect in anticancer chemotherapy. These approaches have the potential to substantially reduce the incidence of life-threatening grade 4 and subtherapeutic grade 0 neutropenia compared to existing approaches. We further show that RL allows to gain further insights by identifying patient factors that drive dose decisions. Due to its flexibility, the proposed combined DA-RL approach can easily be extended to integrate multiple endpoints or patient-reported outcomes, thereby promising important benefits for future personalized therapies.


Robust Reinforcement Learning with Wasserstein Constraint

arXiv.org Machine Learning

Robust Reinforcement Learning aims to find the optimal policy with some extent of robustness to environmental dynamics. Existing learning algorithms usually enable the robustness through disturbing the current state or simulating environmental parameters in a heuristic way, which lack quantified robustness to the system dynamics (i.e. transition probability). To overcome this issue, we leverage Wasserstein distance to measure the disturbance to the reference transition kernel. With Wasserstein distance, we are able to connect transition kernel disturbance to the state disturbance, i.e. reduce an infinite-dimensional optimization problem to a finite-dimensional risk-aware problem. Through the derived risk-aware optimal Bellman equation, we show the existence of optimal robust policies, provide a sensitivity analysis for the perturbations, and then design a novel robust learning algorithm--Wasserstein Robust Advantage Actor-Critic algorithm (WRAAC). The effectiveness of the proposed algorithm is verified in the Cart-Pole environment.


Adversarial Attacks on Reinforcement Learning based Energy Management Systems of Extended Range Electric Delivery Vehicles

arXiv.org Machine Learning

Adversarial examples are firstly investigated in the area of computer vision: by adding some carefully designed ''noise'' to the original input image, the perturbed image that cannot be distinguished from the original one by human, can fool a well-trained classifier easily. In recent years, researchers also demonstrated that adversarial examples can mislead deep reinforcement learning (DRL) agents on playing video games using image inputs with similar methods. However, although DRL has been more and more popular in the area of intelligent transportation systems, there is little research investigating the impacts of adversarial attacks on them, especially for algorithms that do not take images as inputs. In this work, we investigated several fast methods to generate adversarial examples to significantly degrade the performance of a well-trained DRL- based energy management system of an extended range electric delivery vehicle. The perturbed inputs are low-dimensional state representations and close to the original inputs quantified by different kinds of norms. Our work shows that, to apply DRL agents on real-world transportation systems, adversarial examples in the form of cyber-attack should be considered carefully, especially for applications that may lead to serious safety issues.


Deep R-Learning for Continual Area Sweeping

arXiv.org Machine Learning

Coverage path planning is a well-studied problem in robotics in which a robot must plan a path that passes through every point in a given area repeatedly, usually with a uniform frequency. To address the scenario in which some points need to be visited more frequently than others, this problem has been extended to non-uniform coverage planning. This paper considers the variant of non-uniform coverage in which the robot does not know the distribution of relevant events beforehand and must nevertheless learn to maximize the rate of detecting events of interest. This continual area sweeping problem has been previously formalized in a way that makes strong assumptions about the environment, and to date only a greedy approach has been proposed. We generalize the continual area sweeping formulation to include fewer environmental constraints, and propose a novel approach based on reinforcement learning in a Semi-Markov Decision Process. This approach is evaluated in an abstract simulation and in a high fidelity Gazebo simulation. These evaluations show significant improvement upon the existing approach in general settings, which is especially relevant in the growing area of service robotics.


Modern Reinforcement Learning: Actor-Critic Methods

#artificialintelligence

In this advanced course on deep reinforcement learning, you will learn how to implement policy gradient, actor critic, deep deterministic policy gradient (DDPG), and twin delayed deep deterministic policy gradient (TD3) algorithms in a variety of challenging environments from the Open AI gym. From there we will progress to teaching an agent to balance the cart pole using Q learning. After mastering the fundamentals, the pace quickens, and we move straight into an introduction to policy gradient methods. We cover the REINFORCE algorithm, and use it to teach an artificial intelligence to land on the moon in the lunar lander environment from the Open AI gym. Next we progress to coding up the one step actor critic algorithm, to again beat the lunar lander.