Off-Policy Policy Gradient with State Distribution Correction

Liu, Yao, Swaminathan, Adith, Agarwal, Alekh, Brunskill, Emma

arXiv.org Artificial Intelligence 

The ability to use data about prior decisions and their outcomes to make counterfactual inferences about how alternative decision policies might perform, is a cornerstone of intelligent behavior. It also has immense practical potential - it can enable the use of electronic medical record data to infer better treatment decisions for patients, the use of prior product recommendations to inform more effective strategies for presenting recommendations, and previously collected data from students using educational software to better teach those and future students. Such counterfactual reasoning, particularly when one is deriving decision policies that will be used to make not one but a sequence of decisions, is important since online sampling during a learning procedure is both costly and dangerous, and not practical in many of the applications above. While amply motivated, doing such counterfactual reasoning is also challenging because the data is censored - we can only observe the result of providing a particular chemotherapy treatment policy to a particular patient, not the counterfactual of if we were then to start with a radiation sequence. We focus on the problem of performing such counterfactual inferences in the context of sequential decision making in a Markov decision process (MDP).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found