Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

Xu, Yang, Ganesh, Swetha, Aggarwal, Vaneet

Jun-10-2025–arXiv.org Machine Learning

We present the first $Q$-learning and actor-critic algorithms for robust average reward Markov Decision Processes (MDPs) with non-asymptotic convergence under contamination, TV distance and Wasserstein distance uncertainty sets. We show that the robust $Q$ Bellman operator is a strict contractive mapping with respect to a carefully constructed semi-norm with constant functions being quotiented out. This property supports a stochastic approximation update, that learns the optimal robust $Q$ function in $\tilde{\cO}(ε^{-2})$ samples. We also show that the same idea can be used for robust $Q$ function estimation, which can be further used for critic estimation. Coupling it with theories in robust policy mirror descent update, we present a natural actor-critic algorithm that attains an $ε$-optimal robust policy in $\tilde{\cO}(ε^{-3})$ samples. These results advance the theory of distributionally robust reinforcement learning in the average reward setting.

bellman operator, machine learning, reinforcement learning, (16 more...)

arXiv.org Machine Learning

Jun-10-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - India > Karnataka
    - Bengaluru (0.04)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Reinforcement Learning (1.00)
  - Learning Graphical Models > Undirected Networks
    - Markov Models (0.49)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found