Convergent Policy Optimization for Safe Reinforcement Learning

Ming Yu, Zhuoran Yang, Mladen Kolar, Zhaoran Wang

Neural Information Processing Systems 

Given ,J ( )andD ( )arethesample (i.e., atrajectory) . Note J ( ) and D ( ) are randomness J ( )andD ( )todenote anda ClearlyweJ( )= E J ( ) andD( )= E D ( ) .