Batch Reinforcement Learning with a Nonparametric Off-Policy Policy Gradient

Open in new window