Optimization
Appendix A Additional table
Table 2 presents the numerical results for the ablation study in Section 4.2. The results of our main method in Section 4.1 is reported in column Main. Table 3 provides additional ablation study on several building blocks of our main method. T est denotes the variant of using the estimated reward function as the test function when We see that changing the proposed JSD regularizer in Section 3.2 to the KL-dual-based regularizers Changing the implicit policy to the Gaussian policy generally leads to worse performance. The performance difference is especially significant on the Maze2D and Adroit datasets.