step

Neural Information Processing Systems 

Asexplained inthemaintext,thissection presents anexample thatisonlyaslightmodification of the one in Figure 4, but where a multi-step approach is clearly preferred over just one step. Essentially,the coverage of the behavior policyin this example reduces themagnitude oftheevaluation errors. This shows that the one-step algorithm indeed optimizes alower bound onthe performance difference. We are not familiar with similar guarantees for the iterative or multi-step approaches that rely on off-policyevaluation. We follow the practice of Fu et al. [2020] and tune a small set of hyperparameters by interacting with the simulator to estimate the value of the policies learned under each hyperparametersetting.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found