Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs