R1/R3: Running time and practicality of ApproPO: In our experiments, we implement an RL oracle by a policy-2
–Neural Information Processing Systems
We thank the reviewers for their constructive comments. We address the main concerns below. In our implementation, it was crucial to use the improvements from Sec. 3.4. We ran the "positive response" version of Note that the policy mixture returned by ApproPO is just a weighted combination of the policies from cache. We will add this discussion to the paper and also update plots, so they are in terms of transitions rather than trajectories.
Neural Information Processing Systems
Oct-3-2025, 03:53:02 GMT