Cai, Qingpeng, Pan, Ling, Tang, Pingzhong

We study a setting of reinforcement learning, where the state transition is a convex combination of a stochastic continuous function and a deterministic discontinuous function. Such a setting include as a special case the stochastic state transition setting, namely the setting of deterministic policy gradient (DPG). We introduce a theoretical technique to prove the existence of the policy gradient in this generalized setting. Using this technique, we prove that the deterministic policy gradient indeed exists for a certain set of discount factors, and further prove two conditions that guarantee the existence for all discount factors. We then derive a closed form of the policy gradient whenever exists. Interestingly, the form of the policy gradient in such setting is equivalent to that in DPG. Furthermore, to overcome the challenge of high sample complexity of DPG in this setting, we propose the Generalized Deterministic Policy Gradient (GDPG) algorithm. The main innovation of the algorithm is to optimize a weighted objective of the original Markov decision process (MDP) and an augmented MDP that simplifies the original one, and serves as its lower bound. To solve the augmented MDP, we make use of the model-based methods which enable fast convergence. We finally conduct extensive experiments comparing GDPG with state-of-the-art methods on several standard benchmarks. Results demonstrate that GDPG substantially outperforms other baselines in terms of both convergence and long-term rewards.

Ciosek, Kamil, Whiteson, Shimon

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates across the action when estimating the gradient, instead of relying only on the action in the sampled trajectory. We establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. We also prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead. Finally, we show that it is optimal in a certain sense to explore with a Gaussian policy such that the covariance is proportional to the exponential of the scaled Hessian of the critic with respect to the actions. We present empirical results confirming that this new form of exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic in four challenging MuJoCo domains.

Ciosek, Kamil (University of Oxford) | Whiteson, Shimon (University of Oxford)

Off-policy stochastic actor-critic methods rely on approximating the stochastic policy gradient in order to derive an optimal policy. One may also derive the optimal policy by approximating the action-value gradient. The use of action-value gradients is desirable as policy improvement occurs along the direction of steepest ascent. This has been studied extensively within the context of natural gradient actor-critic algorithms and more recently within the context of deterministic policy gradients. In this paper we briefly discuss the off-policy stochastic counterpart to deterministic action-value gradients, as well as an incremental approach for following the policy gradient in lieu of the natural gradient.