Goto

Collaborating Authors

 hyperparameter search



24cceab7ffc1118f5daaace13c670885-Supplemental.pdf

Neural Information Processing Systems

A.1 Algorithm The code is available at https://github.com/mklissa/MOC. A.2 Tabular experiments A.2.1 Implementation Details For our experiments of the FourRooms domain we based our implementation on [Bacon et al., 2016] and ran the experiments for 500 episodes that last a maximum of 1000 steps with goal located in the right hallway. In the first experiment we verify whether learning a fixed set of options can be accelerated by our method. We define this fixed set as the hallway options from Sutton et al. [1999b]. As the policies of these options were deterministic and we use importance sampling, we relax them to stochastic policies where the most likely action is the one leading to a hallway.



SafeDICE: Offline Safe Imitation Learning with Non-Preferred Demonstrations

Neural Information Processing Systems

In this paper, we present a hyperparameter-free offline safe IL algorithm, SafeDICE, that learns safe policy by leveraging the non-preferred demonstrations in the space of stationary distributions. Our algorithm directly estimates the stationary distribution corrections of the policy that imitate the demonstrations excluding the non-preferred behavior.