RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning

Aug-4-2024–arXiv.org Artificial Intelligence

These learning (DRL) method utilizing the methods utilize the discounted reward criterion, which is average reward criterion. While most existing applicable to a variety of MDP-formulated tasks (Puterman, DRL methods employ the discounted reward criterion, 1994). In particular, for continuing tasks where there is this can potentially lead to a discrepancy no natural breakpoint in episodes, such as in robot locomotion between the training objective and performance (Todorov et al., 2012) or Access Control Queuing metrics in continuing tasks, making the average Tasks(Sutton & Barto, 2018), where the interaction between reward criterion a recommended alternative. We an agent and an environment can continue indefinitely, the introduce RVI-SAC, an extension of the state-ofthe-art discount rate plays a role in keeping the infinite horizon off-policy DRL method, Soft Actor-Critic return bounded. However, discounting introduces an undesirable (SAC) (Haarnoja et al., 2018a;b), to the average reward effect in continuing tasks by prioritizing rewards criterion. Our proposal consists of (1) Critic closer to the current time over those in the future. An approach updates based on RVI Q-learning (Abounadi et al., to mitigate this effect is to bring the discount rate 2001), (2) Actor updates introduced by the average closer to 1, but it is commonly known that a large discount reward soft policy improvement theorem, and rate can lead to instability and slower convergence(Fujimoto (3) automatic adjustment of Reset Cost enabling et al., 2018; Dewanto & Gallagher, 2021).

reset, reward criterion, rvi-sac, (11 more...)

arXiv.org Artificial Intelligence

Aug-4-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Massachusetts > Hampshire County
    - Amherst (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Portugal (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - China (0.04)
  - Japan > Honshū
    - Kantō
      - Tokyo Metropolis Prefecture > Tokyo (0.04)
      - Kanagawa Prefecture > Yokohama (0.04)

Genre:
- Research Report > New Finding (0.68)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found