R1/R3: Running time and practicality of ApproPO: In our experiments, we implement an RL oracle by a policy-2

Neural Information Processing Systems 

We thank the reviewers for their constructive comments. We address the main concerns below. In our implementation, it was crucial to use the improvements from Sec. 3.4. We ran the "positive response" version of Note that the policy mixture returned by ApproPO is just a weighted combination of the policies from cache. We will add this discussion to the paper and also update plots, so they are in terms of transitions rather than trajectories.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found