Model-Free Robust Average-Reward Reinforcement Learning

Wang, Yue, Velasquez, Alvaro, Atia, George, Prater-Bennette, Ashley, Zou, Shaofeng

May-17-2023–arXiv.org Artificial Intelligence

Two performance criteria are commonly used for infinitehorizon MDPs: 1) the discounted-reward setting, where Robust Markov decision processes (MDPs) address the reward is discounted exponentially with time; and 2) the challenge of model uncertainty by optimizing the average-reward setting, where the long-term averagereward the worst-case performance over an uncertainty over time is of interest. For systems that operate for set of MDPs. In this paper, we focus on the an extended period of time, e.g., queue control, inventory robust average-reward MDPs under the modelfree management in supply chains, or communication networks, setting. We first theoretically characterize it is more important to optimize the average-reward since the structure of solutions to the robust averagereward policies obtained from the discounted-reward setting may Bellman equation, which is essential for be myopic and have poor long-term performance (Kazemi our later convergence analysis.

artificial intelligence, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

May-17-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States > Massachusetts > Middlesex County (0.14)

Genre:
- Research Report (1.00)

Industry:
- Government > Regional Government > North America Government > United States Government (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning
    - Learning Graphical Models > Undirected Networks
      - Markov Models (0.34)
    - Reinforcement Learning (1.00)
  - Representation & Reasoning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found