Fast Global Convergence of Policy Optimization for Constrained MDPs
Liu, Tao, Zhou, Ruida, Kalathil, Dileep, Kumar, P. R., Tian, Chao
–arXiv.org Artificial Intelligence
We address the issue of safety in reinforcement learning. We pose the problem in a discounted infinite-horizon constrained Markov decision process framework. Existing results have shown that gradient-based methods are able to achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate both for the optimality gap and the constraint violation. We exhibit a natural policy gradient-based algorithm that has a faster convergence rate $\mathcal{O}(\log(T)/T)$ for both the optimality gap and the constraint violation. When Slater's condition is satisfied and known a priori, zero constraint violation can be further guaranteed for a sufficiently large $T$ while maintaining the same convergence rate.
arXiv.org Artificial Intelligence
Oct-31-2021
- Country:
- North America > United States
- Texas > Brazos County > College Station (0.04)
- Europe > United Kingdom
- England > Oxfordshire > Oxford (0.04)
- Asia > Middle East
- Jordan (0.04)
- North America > United States
- Genre:
- Research Report (0.81)
- Industry:
- Energy > Renewable (0.67)
- Government > Regional Government