Reviews: Safe Policy Improvement by Minimizing Robust Baseline Regret

Neural Information Processing Systems 

The paper considers the problem of robust MDP. In particular, it considers that we have an uncertain model of transition probabilities of the MDP (with the rectangular uncertainty) as well as a baseline policy. The goal is to find a new policy such that it is guaranteed to be no worse than the baseline. This is called the safe policy improvement. A robust approach would find a policy that maximizes the worst case performance.