Goto

Collaborating Authors

 thelocaregret


TheLoCARegret: AConsistentMetrictoEvaluate Model-BasedBehaviorinReinforcementLearning--SupplementaryMaterial -- ATabularExperiments

Neural Information Processing Systems

For all tabular experiments, we used -greedy exploration with = 0.1. Furthermore, during pretraining and training, we used a maximum episode-length of 100. For evaluation, we set = 0, and ran 10 evaluation episodes. We used a fixed step-sizeα for all tabular experiments. Therefore, there is stochasticity in the update target even in deterministic environments due to exploration of the behavior policy.