Robust Batch Policy Learning in Markov Decision Processes

Nov-10-2020–arXiv.org Machine Learning

One important goal in sequential decision making problems is to construct a policy that maximizes the average reward over a certain amount of the time. Depending on the purpose of applications, the duration of the learned policy for use in the future (i.e., the planning horizon) is often unknown and can be different from what we consider in the stage of policy optimization. In addition, the performance measure used in learning the policy often depends on the choice of the initial state's distribution. It is always of a great interest to learn a policy with strong generalizability and adaptivity. Given a pre-collected data of multiple trajectories consisting of states, actions and rewards, our goal is to learn a robust policy in the sense that it can guarantee the uniform performance over the unknown planning horizon and the distributional change in the initial state.

average reward, estimator, probability, (15 more...)

arXiv.org Machine Learning

Nov-10-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Massachusetts > Middlesex County > Belmont (0.04)
- Europe > United Kingdom
  - England > Cambridgeshire > Cambridge (0.04)

Genre:
- Research Report (1.00)

Industry:
- Health & Medicine (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Optimization (1.00)
  - Machine Learning
    - Statistical Learning (1.00)
    - Reinforcement Learning (1.00)
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.50)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found