Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning
Xu, Yang, Ganesh, Swetha, Aggarwal, Vaneet
We present the first $Q$-learning and actor-critic algorithms for robust average reward Markov Decision Processes (MDPs) with non-asymptotic convergence under contamination, TV distance and Wasserstein distance uncertainty sets. We show that the robust $Q$ Bellman operator is a strict contractive mapping with respect to a carefully constructed semi-norm with constant functions being quotiented out. This property supports a stochastic approximation update, that learns the optimal robust $Q$ function in $\tilde{\cO}(ε^{-2})$ samples. We also show that the same idea can be used for robust $Q$ function estimation, which can be further used for critic estimation. Coupling it with theories in robust policy mirror descent update, we present a natural actor-critic algorithm that attains an $ε$-optimal robust policy in $\tilde{\cO}(ε^{-3})$ samples. These results advance the theory of distributionally robust reinforcement learning in the average reward setting.
Jun-10-2025
- Country:
- Asia
- India > Karnataka
- Bengaluru (0.04)
- Middle East > Jordan (0.04)
- India > Karnataka
- North America > United States (0.04)
- Asia
- Genre:
- Research Report (0.50)