Appendix AAdditionaltable Table2presentsthenumericalresultsfortheablationstudyinSection4.2

Neural Information Processing Systems 

The results of our main method in Section 4.1 is reported in column Main. Testdenotes the variant of using the estimated reward function as the test function when trainingtheMIWω. Thismayberelatedtotheunstable estimation ofKL-dual discussed in Section3.2. Removing rollout data in the policy learning generally leads to worse performance and larger standard deviations. From Eq. (22), the MIWω can be optimized via two alternativeapproaches.(1)Wecan

Similar Docs  Excel Report  more

TitleSimilaritySource
None found