Cai, Qingpeng, Pan, Ling, Tang, Pingzhong

We study a setting of reinforcement learning, where the state transition is a convex combination of a stochastic continuous function and a deterministic discontinuous function. Such a setting include as a special case the stochastic state transition setting, namely the setting of deterministic policy gradient (DPG). We introduce a theoretical technique to prove the existence of the policy gradient in this generalized setting. Using this technique, we prove that the deterministic policy gradient indeed exists for a certain set of discount factors, and further prove two conditions that guarantee the existence for all discount factors. We then derive a closed form of the policy gradient whenever exists. Interestingly, the form of the policy gradient in such setting is equivalent to that in DPG. Furthermore, to overcome the challenge of high sample complexity of DPG in this setting, we propose the Generalized Deterministic Policy Gradient (GDPG) algorithm. The main innovation of the algorithm is to optimize a weighted objective of the original Markov decision process (MDP) and an augmented MDP that simplifies the original one, and serves as its lower bound. To solve the augmented MDP, we make use of the model-based methods which enable fast convergence. We finally conduct extensive experiments comparing GDPG with state-of-the-art methods on several standard benchmarks. Results demonstrate that GDPG substantially outperforms other baselines in terms of both convergence and long-term rewards.

Ciosek, Kamil, Whiteson, Shimon

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates across the action when estimating the gradient, instead of relying only on the action in the sampled trajectory. We establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. We also prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead. Finally, we show that it is optimal in a certain sense to explore with a Gaussian policy such that the covariance is proportional to the exponential of the scaled Hessian of the critic with respect to the actions. We present empirical results confirming that this new form of exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic in four challenging MuJoCo domains.

Ciosek, Kamil (University of Oxford) | Whiteson, Shimon (University of Oxford)

In the previous blog post Using Keras and Deep Q-Network to Play FlappyBird we demonstrate using Deep Q-Network to play FlappyBird. However, a big limitation of Deep Q-Network is that the outputs/actions are discrete while the action like steering are continuous in car racing. An obvious approach to adapt DQN to continuous domains is to simply discretize the action space. However, we encounter the "curse of dimensionality" problem. For example, if you discretize the steering wheel from -90 to 90 degrees in 5 degrees each and acceleration from 0km to 300km in 5km each, your output combinations will be 36 steering states times 60 velocity states which equals to 2160 possible combinations. The situation will be worse when you want to build robots to perform something very specialized, such as brain surgery that requires fine control of actions and naive discretization will not able to achieve the required precision to do the operations.