Off-Policy Policy Gradient with State Distribution Correction
Liu, Yao, Swaminathan, Adith, Agarwal, Alekh, Brunskill, Emma
–arXiv.org Artificial Intelligence
The ability to use data about prior decisions and their outcomes to make counterfactual inferences about how alternative decision policies might perform, is a cornerstone of intelligent behavior. It also has immense practical potential - it can enable the use of electronic medical record data to infer better treatment decisions for patients, the use of prior product recommendations to inform more effective strategies for presenting recommendations, and previously collected data from students using educational software to better teach those and future students. Such counterfactual reasoning, particularly when one is deriving decision policies that will be used to make not one but a sequence of decisions, is important since online sampling during a learning procedure is both costly and dangerous, and not practical in many of the applications above. While amply motivated, doing such counterfactual reasoning is also challenging because the data is censored - we can only observe the result of providing a particular chemotherapy treatment policy to a particular patient, not the counterfactual of if we were then to start with a radiation sequence. We focus on the problem of performing such counterfactual inferences in the context of sequential decision making in a Markov decision process (MDP).
arXiv.org Artificial Intelligence
Apr-17-2019
- Country:
- North America > Canada > Alberta (0.14)
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine
- Health Care Technology > Medical Record (0.54)
- Therapeutic Area > Immunology (0.71)
- Health & Medicine