Towards Principled, Practical Policy Gradient for Bandits and Tabular MDPs

Open in new window