On the Convergence of Discounted Policy Gradient Methods

Jan-9-2023–arXiv.org Artificial Intelligence

Policy gradient methods are a class of reinforcement learning (RL) algorithms that attempt to directly maximize the expected performance of an agent's policy by following the gradient of an objective function (Sutton et al., 2000), typically the expected sum of rewards, using a stochastic estimator generated by interacting with the environment. Unbiased estimators of this gradient can suffer from high variance due to high variance in the sum of future rewards. A common approach is to instead consider an exponentially discounted sum of future rewards. This approach reduces the variance of most estimators but introduces bias (Thomas, 2014). Frequently, the discounted sum of future rewards is estimated by a critic (Konda and Tsitsiklis, 2000). It has been argued that when a critic is used, discounting has the additional benefit of reducing approximation error (Zhang et al., 2020). The "discounted" policy gradient was originally introduced as the gradient of a discounted objective (Sutton et al., 2000). However, it has been shown that the gradient of the discounted objective does not produce the update direction followed by most discounted policy gradient algorithms (Thomas, 2014; Nota and Thomas, 2019).

approximation, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

Jan-9-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States > Massachusetts > Hampshire County > Amherst (0.04)

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found