Policy Gradient using Weak Derivatives for Reinforcement Learning
Bhatt, Sujay, Koppel, Alec, Krishnamurthy, Vikram
This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results:(i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be $O(1/\sqrt(k))$; (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on OpenAI gym pendulum environment show superior performance of the proposed algorithm.
Apr-9-2020
- Country:
- North America
- Canada > Alberta (0.14)
- United States
- Pennsylvania (0.04)
- New York (0.04)
- New Jersey > Mercer County
- Princeton (0.04)
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- Romania > Sud-Est Development Region
- Tulcea County > Tulcea (0.04)
- United Kingdom > England
- Asia > Middle East
- Jordan (0.06)
- North America
- Genre:
- Research Report (0.40)
- Industry:
- Education > Focused Education > Special Education (0.45)
- Technology: