Appendix AAdditionaltable Table2presentsthenumericalresultsfortheablationstudyinSection4.2
–Neural Information Processing Systems
The results of our main method in Section 4.1 is reported in column Main. Testdenotes the variant of using the estimated reward function as the test function when trainingtheMIWω. Thismayberelatedtotheunstable estimation ofKL-dual discussed in Section3.2. Removing rollout data in the policy learning generally leads to worse performance and larger standard deviations. From Eq. (22), the MIWω can be optimized via two alternativeapproaches.(1)Wecan
Neural Information Processing Systems
Feb-9-2026, 16:16:12 GMT