Anchor-Changing Regularized Natural Policy Gradientfor Multi-Objective Reinforcement Learning

Neural Information Processing Systems 

Let = betheoptimalpolicyofthe CMDPproblemin (9). Theorem 3.ForanyK 1, takeuniformpolicy 0, 0 16 , 6 (1 )3, = 1 , and tk =d 11 log (5LK6 log (|A|))+1 e.