RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning

Hisaki, Yukinari, Ono, Isao

arXiv.org Artificial Intelligence 

These learning (DRL) method utilizing the methods utilize the discounted reward criterion, which is average reward criterion. While most existing applicable to a variety of MDP-formulated tasks (Puterman, DRL methods employ the discounted reward criterion, 1994). In particular, for continuing tasks where there is this can potentially lead to a discrepancy no natural breakpoint in episodes, such as in robot locomotion between the training objective and performance (Todorov et al., 2012) or Access Control Queuing metrics in continuing tasks, making the average Tasks(Sutton & Barto, 2018), where the interaction between reward criterion a recommended alternative. We an agent and an environment can continue indefinitely, the introduce RVI-SAC, an extension of the state-ofthe-art discount rate plays a role in keeping the infinite horizon off-policy DRL method, Soft Actor-Critic return bounded. However, discounting introduces an undesirable (SAC) (Haarnoja et al., 2018a;b), to the average reward effect in continuing tasks by prioritizing rewards criterion. Our proposal consists of (1) Critic closer to the current time over those in the future. An approach updates based on RVI Q-learning (Abounadi et al., to mitigate this effect is to bring the discount rate 2001), (2) Actor updates introduced by the average closer to 1, but it is commonly known that a large discount reward soft policy improvement theorem, and rate can lead to instability and slower convergence(Fujimoto (3) automatic adjustment of Reset Cost enabling et al., 2018; Dewanto & Gallagher, 2021).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found