873be0705c80679f2c71fbf4d872df59-AuthorFeedback.pdf

Neural Information Processing Systems 

We thank the reviewers for their constructive comments. We address the main concerns below. In our implementation, it was crucial to use the improvements from Sec. 3.4. We ran the "positive response" version of ApproPO (Algorithm 5) for 2000 outer-loop iterations (i.e., 2000 updates of λ), but needed to make at most 61 RL Note that the policy mixture returned by ApproPO is just a weighted combination of the policies from cache. We will add this discussion to the paper and also update plots, so they are in terms of transitions rather than trajectories.