Goto

Collaborating Authors

 Faisal, Aldo


Bag of Policies for Distributional Deep Exploration

arXiv.org Artificial Intelligence

Efficient exploration in complex environments remains Distributional RL (DiRL) has rapidly established its place a major challenge for reinforcement learning among reinforcement learning (RL) algorithms Bellemare (RL). Compared to previous Thompson samplinginspired et al. [2017] as a powerful improvement over nondistributional mechanisms that enable temporally extended value-based counterparts Lyle et al. [2019]. In exploration, i.e., deep exploration, we focus DiRL, the agent does not learn a single summary statistic of on deep exploration in distributional RL. We develop the return for each state-action pair, but instead learns the here a general purpose approach, Bag of Policies whole return distribution. The agent's behaviour is being (BoP), that can be built on top of any return evaluated for multiple possible consequences which in turn distribution estimator by maintaining a population affect the policy update. While this does lead to more stable of its copies. BoP consists of an ensemble of multiple learning and better performance Lyle et al. [2019], it does heads that are updated independently. During not itself change the way actions are selected; as distributional training, each episode is controlled by only one of extensions to value-based RL, in C51 Bellemare et al. the heads and the collected state-action pairs are [2017], QR-DQN Dabney et al. [2018b] the agent still takes used to update all heads off-policy, leading to distinct actions according to the mean of the estimated return distributions learning signals for each head which diversify in each state-action pair.


Action Grammars: A Cognitive Model for Learning Temporal Abstractions

arXiv.org Artificial Intelligence

Hierarchical Reinforcement Learning algorithms have successfully been applied to temporal credit assignment problems with sparse reward signals. However, state-of- the-art algorithms require manual specification of sub-task structures, a sample inefficient exploration phase and lack semantic interpretability. Human infants, on the other hand, efficiently detect hierarchical substructures induced by their surroundings. In this work we propose a cognitive-inspired Reinforcement Learning architecture which uses grammar induction to identify sub-goal policies. More specifically, by treating an on-policy trajectory as a sentence sampled from the policy-conditioned language of the environment, we identify hierarchical constituents with the help of unsupervised grammatical inference. The resulting set of temporal abstractions is called action grammars (Pastra & Aloimonos, 2012) and can be used to enable efficient imitation, transfer and online learning.


RLOC: Neurobiologically Inspired Hierarchical Reinforcement Learning Algorithm for Continuous Control of Nonlinear Dynamical Systems

arXiv.org Machine Learning

Nonlinear optimal control problems are often solved with numerical methods that require knowledge of system's dynamics which may be difficult to infer, and that carry a large computational cost associated with iterative calculations. We present a novel neurobiologically inspired hierarchical learning framework, Reinforcement Learning Optimal Control, which operates on two levels of abstraction and utilises a reduced number of controllers to solve nonlinear systems with unknown dynamics in continuous state and action spaces. Our approach is inspired by research at two levels of abstraction: first, at the level of limb coordination human behaviour is explained by linear optimal feedback control theory. Second, in cognitive tasks involving learning symbolic level action selection, humans learn such problems using model-free and model-based reinforcement learning algorithms. We propose that combining these two levels of abstraction leads to a fast global solution of nonlinear control problems using reduced number of controllers. Our framework learns the local task dynamics from naive experience and forms locally optimal infinite horizon Linear Quadratic Regulators which produce continuous low-level control. A top-level reinforcement learner uses the controllers as actions and learns how to best combine them in state space while maximising a long-term reward. A single optimal control objective function drives high-level symbolic learning by providing training signals on desirability of each selected controller. We show that a small number of locally optimal linear controllers are able to solve global nonlinear control problems with unknown dynamics when combined with a reinforcement learner in this hierarchical framework. Our algorithm competes in terms of computational cost and solution quality with sophisticated control algorithms and we illustrate this with solutions to benchmark problems.


Improving Sepsis Treatment Strategies by Combining Deep and Kernel-Based Reinforcement Learning

arXiv.org Machine Learning

Sepsis is the leading cause of mortality in the ICU. It is challenging to manage because individual patients respond differently to treatment. Thus, tailoring treatment to the individual patient is essential for the best outcomes. In this paper, we take steps toward this goal by applying a mixture-of-experts framework to personalize sepsis treatment. The mixture model selectively alternates between neighbor-based (kernel) and deep reinforcement learning (DRL) experts depending on patient's current history. On a large retrospective cohort, this mixture-based approach outperforms physician, kernel only, and DRL-only experts.


Behaviour Policy Estimation in Off-Policy Policy Evaluation: Calibration Matters

arXiv.org Machine Learning

In this work, we consider the problem of estimating a behaviour policy for use in Off-Policy Policy Evaluation (OPE) when the true behaviour policy is unknown. Via a series of empirical studies, we demonstrate how accurate OPE is strongly dependent on the calibration of estimated behaviour policy models: how precisely the behaviour policy is estimated from data. We show how powerful parametric models such as neural networks can result in highly uncalibrated behaviour policy models on a real-world medical dataset, and illustrate how a simple, non-parametric, k-nearest neighbours model produces better calibrated behaviour policy estimates and can be used to obtain superior importance sampling-based OPE estimates.


Evaluating Reinforcement Learning Algorithms in Observational Health Settings

arXiv.org Machine Learning

Much attention has been devoted recently to the development of machine learning algorithms with the goal of improving treatment policies in healthcare. Reinforcement learning (RL) is a sub-field within machine learning that is concerned with learning how to make sequences of decisions so as to optimize long-term effects. Already, RL algorithms have been proposed to identify decision-making strategies for mechanical ventilation, sepsis management and treatment of schizophrenia. However, before implementing treatment policies learned by black-box algorithms in high-stakes clinical decision problems, special care must be taken in the evaluation of these policies. In this document, our goal is to expose some of the subtleties associated with evaluating RL algorithms in healthcare. We aim to provide a conceptual starting point for clinical and computational researchers to ask the right questions when designing and evaluating algorithms for new ways of treating patients. In the following, we describe how choices about how to summarize a history, variance of statistical estimators, and confounders in more ad-hoc measures can result in unreliable, even misleading estimates of the quality of a treatment policy. We also provide suggestions for mitigating these effects---for while there is much promise for mining observational health data to uncover better treatment policies, evaluation must be performed thoughtfully.


Representation Balancing MDPs for Off-Policy Policy Evaluation

arXiv.org Artificial Intelligence

We study the problem of off-policy policy evaluation (OPPE) in RL. In contrast to prior work, we consider how to estimate both the individual policy value and average policy value accurately. We draw inspiration from recent work in causal reasoning, and propose a new finite sample generalization error bound for value estimates from MDP models. Using this upper bound as an objective, we develop a learning algorithm of an MDP model with a balanced representation, and show that our approach can yield substantially lower MSE in a common synthetic domain and on a challenging real-world sepsis management problem.