Batch Policy Learning in Average Reward Markov Decision Processes
Liao, Peng, Qi, Zhengling, Murphy, Susan
We study the problem of policy optimization in Markov Decision Process over infinite time horizons (Puterman, 1994). We focus on the batch (i.e., off-line) setting, where historical data of multiple trajectories has been previously collected using some behavior policy. Our goal is to learn a new policy with guaranteed performance when implemented in the future. In this work, we develop a data-efficient method to learn the policy that optimizes the long-term average reward in a pre-specified policy class from a training set composed of multiple trajectories. Furthermore, we establish a finite-sample regret guarantee, i.e., the difference between the average reward of the optimal policy in the class and the average reward of the estimated policy by our proposed method. This work is motivated by the development of justin-time adaptive intervention in mobile health (mHealth) applications (Nahum-Shani et al., 2017). Our method can be used to learn a treatment policy that maps the real-time collected information about the individual's status and context to a particular treatment at each of many decision times to support health behaviors.
Jul-22-2020
- Country:
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- North America > United States
- Massachusetts > Middlesex County > Belmont (0.04)
- Europe > United Kingdom
- Genre:
- Research Report (0.82)
- Industry:
- Health & Medicine (0.46)