We consider the reinforcement learning problem for the constrained Markov decision process (CMDP), which plays a central role in satisfying safety or resource constraints in sequential learning and decision-making.
Before deploying any newly developed policy, it is important to assess its impact. In many high-stakes domains, it is risky or unethical to implement such policies directly for online evaluation.