Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

Open in new window