Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration
Yu, Huizhen, Wan, Yi, Sutton, Richard S.
–arXiv.org Artificial Intelligence
This paper applies the authors' recent results on asynchronous stochastic approximation (SA) in the Borkar-Meyn framework to reinforcement learning in average-reward semi-Markov decision processes (SMDPs). We establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. In particular, we show that the algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to a unique, sample path-dependent solution under additional stepsize and asynchrony conditions. Moreover, to make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework and are addressed through novel arguments in the stability and convergence analysis of RVI Q-learning.
arXiv.org Artificial Intelligence
Dec-9-2025
- Country:
- Asia
- North America
- Canada > Alberta (0.14)
- United States > New York (0.04)
- Genre:
- Research Report (0.64)