convergence of several policy gradient methods, whose novelty is summarized in Lines 210-212 and further explained