Robust Batch Policy Learning in Markov Decision Processes
One important goal in sequential decision making problems is to construct a policy that maximizes the average reward over a certain amount of the time. Depending on the purpose of applications, the duration of the learned policy for use in the future (i.e., the planning horizon) is often unknown and can be different from what we consider in the stage of policy optimization. In addition, the performance measure used in learning the policy often depends on the choice of the initial state's distribution. It is always of a great interest to learn a policy with strong generalizability and adaptivity. Given a pre-collected data of multiple trajectories consisting of states, actions and rewards, our goal is to learn a robust policy in the sense that it can guarantee the uniform performance over the unknown planning horizon and the distributional change in the initial state.
Nov-10-2020
- Country:
- North America > United States
- Massachusetts > Middlesex County > Belmont (0.04)
- Europe > United Kingdom
- England > Cambridgeshire > Cambridge (0.04)
- North America > United States
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine (1.00)