Model-Free Robust Average-Reward Reinforcement Learning

Wang, Yue, Velasquez, Alvaro, Atia, George, Prater-Bennette, Ashley, Zou, Shaofeng

arXiv.org Artificial Intelligence 

Two performance criteria are commonly used for infinitehorizon MDPs: 1) the discounted-reward setting, where Robust Markov decision processes (MDPs) address the reward is discounted exponentially with time; and 2) the challenge of model uncertainty by optimizing the average-reward setting, where the long-term averagereward the worst-case performance over an uncertainty over time is of interest. For systems that operate for set of MDPs. In this paper, we focus on the an extended period of time, e.g., queue control, inventory robust average-reward MDPs under the modelfree management in supply chains, or communication networks, setting. We first theoretically characterize it is more important to optimize the average-reward since the structure of solutions to the robust averagereward policies obtained from the discounted-reward setting may Bellman equation, which is essential for be myopic and have poor long-term performance (Kazemi our later convergence analysis.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found