Reviews: Neural Proximal/Trust Region Policy Optimization Attains Globally Optimal Policy
–Neural Information Processing Systems
Originality: The authors apply the idea that overparametrization induces local linearization, which has been documented for supervised learning, and in another submission for TD learning. In particular, they decompose the error into two terms, one due to TD, and the other due to SGD, and incorporate them in the analysis of infinite-dimensional mirror descent. The insight that the previous previous analysis for TD could be generalised to a meta algorithm that includes both TD and SGD as particular cases is key. Related work is adequately cited, and differences with previous works are clearly stated, including differences with the sister submission [5]. Quality: The submission seems technically sound, and includes detailed proofs (I just skimmed through them). This is a complete piece of work.
Neural Information Processing Systems
Jan-22-2025, 07:33:15 GMT
- Technology: