Appendix A Reminders about integral probability metrics Let
–Neural Information Processing Systems
In the context of Section 4.1, we have (at least) the following instantiations of Assumption 4.2: (i) Assume the reward is bounded by r We provide a proof for Lemma 4.1 for completeness. Now we prove Theorem 4.2. We first note that a two-sided bound follows from Lemma 4.1: | η We outline the practical MOPO algorithm in Algorithm 2. To answer question (3), we conduct a thorough ablation study on MOPO. The main goal of the ablation study is to understand how the choice of reward penalty affects performance. Require: reward penalty coefficient λ rollout horizon h, rollout batch size b .
Neural Information Processing Systems
Oct-3-2025, 18:17:40 GMT