Policy Gradient With Serial Markov Chain Reasoning

Neural Information Processing Systems 

We introduce a new framework that performs decision-making in reinforcement learning (RL) as an iterative reasoning process.