Learning Continuous Control Policies by Stochastic Value Gradients Nicolas Heess, Greg Wayne