70d31b87bd021441e5e6bf23eb84a306-Supplemental.pdf

Neural Information Processing Systems 

Reshaping the MDP as in(1) preserves the following characteristics: 1) If h(s) [0, 11 γ], then eVπ(s) [0, 11 γ] for all π and s S. 2) If fM is a linear MDP with feature vector φ(s,a) (i.e. Lemma B.2 can be obtained by settingV = f; Lemma B.1 can be obtained by further settingλ = 0(that is, Lemma B.1 is aspecial case of Lemma B.2 withλ = 0; and Lemma A.1 generalizesboth). Then the best configuration was used in the following experiments. These results are shown in Figure 1, where HuRL-VAEMC denotes HuRL using this VAE-based heuristic. We used PPO [36] implemented in RLlib (Apache License 2.0) [57] as the base algorithm.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found