Equivalence of stochastic and deterministic policy gradients

Todorov, Emo

arXiv.org Artificial Intelligence 

Policy gradients methods [1, 3, 4, 6, 7] are now widely used, and have produced impressive empirical results. In continuous control, they have been derived for both stochastic [3] and deterministic [7] policies. The resulting algorithms have different strengths and weaknesses, even though there are a number of similarities in how they are applied in practice, and recent generalizations [10, 13] have pointed to even deeper similarities. Here we study the relationship between the two. First we focus on MDPs with Gaussian control noise and quadratic control cost (while the dynamics and state cost remain general) and show that deterministic and stochastic policy gradients for such MDPs are equivalent. We then develop a much more general result, where any MDP with stochastic policy can be converted into an equivalent MDP with deterministic policy. The new MDP has the same state and policy parameters, but a different control space - namely the sufficient statistics of the stochastic policy in the original MDP. The only quantities that are not equivalent (in either the special or general case) are the state-control value functions. The latter observation suggests that policy gradient methods can be unified by approximating state value functions, instead of the common practice of approximating state-control value functions.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found