Review for NeurIPS paper: First Order Constrained Optimization in Policy Space

Neural Information Processing Systems 

Strengths: The idea behind this work is very clear. Theorem 1 can be easily derived using duality, but the difficulty lies in how to solve it. The paper is creative in its simplification on the update of dual varaibles lambda and mu. Without this, the algorithm would be pretty much a primal-dual gradient method, which requires the primal part to be evaluated accurately enough before the update of dual variables, which is difficult in practice. For lambda, the paper takes advantage of its similarity to the temperature term in maximum entropy RL and thus fixes it during training.