UnderstandingtheEffectofStochasticity inPolicyOptimization
–Neural Information Processing Systems
Until recently it had generally been assumed thatmethods based onfollowingthepolicygradient (PG)[1]could notbeguaranteed toconverge to globally optimal solutions, given that the policy value function is not concave.
Neural Information Processing Systems
Feb-10-2026, 08:44:23 GMT