POMO: Policy Optimization with Multiple Optima for Reinforcement Learning

Neural Information Processing Systems 

Empirically, the low-variance baseline of POMO makes RL training fast and stable, and it is more resistant to local minima compared to previous approaches.