Policy Gradient With Serial Markov Chain Reasoning