Appendix A Reminders about integral probability metrics Let

Neural Information Processing Systems 

In the context of Section 4.1, we have (at least) the following instantiations of Assumption 4.2: (i) Assume the reward is bounded by r We provide a proof for Lemma 4.1 for completeness. Now we prove Theorem 4.2. We first note that a two-sided bound follows from Lemma 4.1: | η We outline the practical MOPO algorithm in Algorithm 2. To answer question (3), we conduct a thorough ablation study on MOPO. The main goal of the ablation study is to understand how the choice of reward penalty affects performance. Require: reward penalty coefficient λ rollout horizon h, rollout batch size b .

Similar Docs  Excel Report  more

TitleSimilaritySource
None found