Reviews: A Lyapunov-based Approach to Safe Reinforcement Learning

Neural Information Processing Systems 

The focus is safe reinforcement learning under constrained markov decision process framework, where safety can be expressed as policy-dependent constraints. Two key assumptions are that (i) we have access to a safe baseline policy and (ii) this baseline policy is close enough to the unknown optimal policy under total variation distance (this is assumption 1 in the paper). A key insight into the technical approach is to augment the unknown, optimal safety constraints with some cost-shaping function, in order to turn the safety constraint into a Lyapunov function wrt the baseline policy. Similar to how identifying Lyapunov function is not trivial, this cost shaping function is also difficult to compute. So the authors propose several approximations, including solving a LP for each loop of a policy iteration and value iteration procedure.