Goto

Collaborating Authors

Gottesman, Omer


Representation Balancing MDPs for Off-policy Policy Evaluation

Neural Information Processing Systems

We study the problem of off-policy policy evaluation (OPPE) in RL. In contrast to prior work, we consider how to estimate both the individual policy value and average policy value accurately. We draw inspiration from recent work in causal reasoning, and propose a new finite sample generalization error bound for value estimates from MDP models. Using this upper bound as an objective, we develop a learning algorithm of an MDP model with a balanced representation, and show that our approach can yield substantially lower MSE in common synthetic benchmarks and a HIV treatment simulation domain. Papers published at the Neural Information Processing Systems Conference.


Interpretable Off-Policy Evaluation in Reinforcement Learning by Highlighting Influential Transitions

arXiv.org Machine Learning

Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education, but safe deployment in high stakes settings requires ways of assessing its validity. Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding. In this paper we develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates. This is accomplished by highlighting observations in the data whose removal will have a large effect on the OPE estimate, and formulating a set of rules for choosing which ones to present to domain experts for validation. We develop methods to compute exactly the influence functions for fitted Q-evaluation with two different function classes: kernel-based and linear least squares. Experiments on medical simulations and real-world intensive care unit data demonstrate that our method can be used to identify limitations in the evaluation process and make evaluation more robust.


A general method for regularizing tensor decomposition methods via pseudo-data

arXiv.org Machine Learning

Tensor decomposition methods (TDMs) have recently gained popularity as ways of performing inference for latent variable models [Anandkumar et al., 2014]. The interest in these methods is motivated by the fact that they come with theoretical global convergence guarantees in the limit of infinite data [Anandkumar et al., 2012, Arora et al., 2013]. However, a main limitation of these methods is that they lack natural methods for regularization or encouraging desired properties on the model parameters when the amount of data is limited. Previous works attempted to alleviate this drawback by modifying existing tensor decomposition methods to incorporate specific constraints, such as sparsity [Sun et al., 2015], or incorporate modeling assumptions, such as the existence of anchor words [Arora et al., 2013, Nguyen et al., 2014]. All of these works develop bespoke algorithms tailored to those constraints or assumptions. Furthermore, many of these methods impose hard constraints on the learned model, which may be detrimental as the size of the data grow--framed in the context of Bayesian intuition, when we have a lot of data, we want our methods to allow the evidence to overwhelm our priors. We introduce an alternative approach which can be applied to encourage any (differentiable) desired structure or properties on the model parameters, and which will only encourage this "prior" information when the data is insufficient. Specifically, we adopt the common view of Bayesian priors as representing "pseudo-observations" of artificial data which bias our learned model parameters towards our prior belief [Bishop, 2006]. We apply the tensor decomposition method of Anandkumar et al.


Combining Parametric and Nonparametric Models for Off-Policy Evaluation

arXiv.org Machine Learning

We consider a model-based approach to perform batch off-policy evaluation in reinforcement learning. Our method takes a mixture-of-experts approach to combine parametric and non-parametric models of the environment such that the final value estimate has the least expected error. We do so by first estimating the local accuracy of each model and then using a planner to select which model to use at every time step as to minimize the return error estimate along entire trajectories. Across a variety of domains, our mixture-based approach outperforms the individual models alone as well as state-of-the-art importance sampling-based estimators.


Improving Sepsis Treatment Strategies by Combining Deep and Kernel-Based Reinforcement Learning

arXiv.org Machine Learning

Sepsis is the leading cause of mortality in the ICU. It is challenging to manage because individual patients respond differently to treatment. Thus, tailoring treatment to the individual patient is essential for the best outcomes. In this paper, we take steps toward this goal by applying a mixture-of-experts framework to personalize sepsis treatment. The mixture model selectively alternates between neighbor-based (kernel) and deep reinforcement learning (DRL) experts depending on patient's current history. On a large retrospective cohort, this mixture-based approach outperforms physician, kernel only, and DRL-only experts.


Representation Balancing MDPs for Off-policy Policy Evaluation

Neural Information Processing Systems

We study the problem of off-policy policy evaluation (OPPE) in RL. In contrast to prior work, we consider how to estimate both the individual policy value and average policy value accurately. We draw inspiration from recent work in causal reasoning, and propose a new finite sample generalization error bound for value estimates from MDP models. Using this upper bound as an objective, we develop a learning algorithm of an MDP model with a balanced representation, and show that our approach can yield substantially lower MSE in common synthetic benchmarks and a HIV treatment simulation domain.


Representation Balancing MDPs for Off-policy Policy Evaluation

Neural Information Processing Systems

We study the problem of off-policy policy evaluation (OPPE) in RL. In contrast to prior work, we consider how to estimate both the individual policy value and average policy value accurately. We draw inspiration from recent work in causal reasoning, and propose a new finite sample generalization error bound for value estimates from MDP models. Using this upper bound as an objective, we develop a learning algorithm of an MDP model with a balanced representation, and show that our approach can yield substantially lower MSE in common synthetic benchmarks and a HIV treatment simulation domain.


Behaviour Policy Estimation in Off-Policy Policy Evaluation: Calibration Matters

arXiv.org Machine Learning

In this work, we consider the problem of estimating a behaviour policy for use in Off-Policy Policy Evaluation (OPE) when the true behaviour policy is unknown. Via a series of empirical studies, we demonstrate how accurate OPE is strongly dependent on the calibration of estimated behaviour policy models: how precisely the behaviour policy is estimated from data. We show how powerful parametric models such as neural networks can result in highly uncalibrated behaviour policy models on a real-world medical dataset, and illustrate how a simple, non-parametric, k-nearest neighbours model produces better calibrated behaviour policy estimates and can be used to obtain superior importance sampling-based OPE estimates.


Evaluating Reinforcement Learning Algorithms in Observational Health Settings

arXiv.org Machine Learning

Much attention has been devoted recently to the development of machine learning algorithms with the goal of improving treatment policies in healthcare. Reinforcement learning (RL) is a sub-field within machine learning that is concerned with learning how to make sequences of decisions so as to optimize long-term effects. Already, RL algorithms have been proposed to identify decision-making strategies for mechanical ventilation, sepsis management and treatment of schizophrenia. However, before implementing treatment policies learned by black-box algorithms in high-stakes clinical decision problems, special care must be taken in the evaluation of these policies. In this document, our goal is to expose some of the subtleties associated with evaluating RL algorithms in healthcare. We aim to provide a conceptual starting point for clinical and computational researchers to ask the right questions when designing and evaluating algorithms for new ways of treating patients. In the following, we describe how choices about how to summarize a history, variance of statistical estimators, and confounders in more ad-hoc measures can result in unreliable, even misleading estimates of the quality of a treatment policy. We also provide suggestions for mitigating these effects---for while there is much promise for mining observational health data to uncover better treatment policies, evaluation must be performed thoughtfully.


Representation Balancing MDPs for Off-Policy Policy Evaluation

arXiv.org Artificial Intelligence

We study the problem of off-policy policy evaluation (OPPE) in RL. In contrast to prior work, we consider how to estimate both the individual policy value and average policy value accurately. We draw inspiration from recent work in causal reasoning, and propose a new finite sample generalization error bound for value estimates from MDP models. Using this upper bound as an objective, we develop a learning algorithm of an MDP model with a balanced representation, and show that our approach can yield substantially lower MSE in a common synthetic domain and on a challenging real-world sepsis management problem.