Policy Optimization for Robust Average Reward MDPs
–Neural Information Processing Systems
This paper studies first-order policy optimization for robust average cost Markov decision processes (MDPs). Specifically, we focus on ergodic Markov chains. For robust average cost MDPs, the goal is to optimize the worst-case average cost over an uncertainty set of transition kernels. We first develop a sub-gradient of the robust average cost. Based on the sub-gradient, a robust policy mirror descent approach is further proposed.
Neural Information Processing Systems
May-26-2025, 18:13:49 GMT
- Technology: