Learning Stochastic Optimal Policies via Gradient Descent