Safe Policy Improvement by Minimizing Robust Baseline Regret

Petrik, Marek, Chow, Yinlam, Ghavamzadeh, Mohammad

arXiv.org Machine Learning 

Many problems in science and engineering can be formulated as a sequential decision-making problem under uncertainty. A common scenario in such problems that occurs in many different fields, such as online marketing, inventory control, health informatics, and computational finance, is to find a good or an optimal strategy/policy, given a batch of data generated by the current strategy of the company (hospital, investor). Although there are many techniques to find a good policy given a batch of data, only a few of them guarantee that the obtained policy will perform well, when it is deployed. Since deploying an untested policy can be risky for the business, the product (hospital, investment) manager does not usually allow it to happen, unless we provide her/him with some performance guarantees of the obtained strategy, in comparison to the baseline policy (e.g., the policy that is currently in use). In this paper, we focus on the model-based approach to this fundamental problem in the context of infinite-horizon discounted Markov decision processes (MDPs). In this approach, we use the batch of data and build a model or a simulator that approximates the true behavior of the dynamical system, together with an error function that captures the accuracy of the model at each state of the system. Our goal is to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as the baseline strategy, using the simulator and error function. Most of the work on this topic has been in the model-free setting, where safe policies are computed directly from the batch of data, without building an explicit model of the system [12, 13]. Another class of model-free algorithms are those that use a batch of data generated by the current policy and return a policy that is guaranteed to perform better.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found