Reviews: Distributional Policy Optimization: An Alternative Approach for Continuous Control

Neural Information Processing Systems 

This paper proposes a distributional policy optimization (DPO) framework and its practical implementation, generative actor-critic (GAC) that belongs to off-policy actor-critic methods. Policy gradient methods, which are currently dominant in continuous control problems, are prone to local optima, thus it is valuable to propose a method addressing that problem fundamentally. Overall, the paper is well written and the proposed algorithm seems novel and sound. Does it stand for'every' state-action pair and state, or the state-action pairs that are visited by the current policy \pi_k'? If it corresponds to the latter, it seems that DPO would possibly not converge to the global optima.