Addressing Value Estimation Errors in Reinforcement Learning with a State-Action Return Distribution Function

Duan, Jingliang, Guan, Yang, Ren, Yangang, Li, Shengbo Eben, Cheng, Bo

Jan-8-2020–arXiv.org Artificial Intelligence

In current reinforcement learning (RL) methods, function approximation errors are known to lead to the overestimated or underestimated state-action values Q, which further lead to suboptimal policies. We show that the learning of a state-action return distribution function can be used to improve the estimation accuracy of the Q-value. We combine the distributional return function within the maximum entropy RL framework in order to develop what we call the Distributional Soft Actor-Critic algorithm, DSAC, which is an off-policy method for continuous control setting. Unlike traditional distributional Q algorithms which typically only learn a discrete return distribution, DSAC can directly learn a continuous return distribution by truncating the difference between the target and current return distribution to prevent gradient explosion. Additionally, we propose a new Parallel Asynchronous Buffer-Actor-Learner architecture (PABAL) to improve the learning efficiency. We evaluate our method on the suite of MuJoCo continuous control tasks, achieving the state of the art performance.

algorithm, reinforcement learning, return distribution, (13 more...)

arXiv.org Artificial Intelligence

Jan-8-2020

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia > Arlington County > Arlington (0.04)
- Europe
  - Spain > Canary Islands (0.04)
  - Sweden > Stockholm
    - Stockholm (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - China > Beijing
    - Beijing (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Machine Learning > Reinforcement Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found