TheLoCARegret: AConsistentMetrictoEvaluate Model-BasedBehaviorinReinforcementLearning--SupplementaryMaterial -- ATabularExperiments

Neural Information Processing Systems 

For all tabular experiments, we used -greedy exploration with = 0.1. Furthermore, during pretraining and training, we used a maximum episode-length of 100. For evaluation, we set = 0, and ran 10 evaluation episodes. We used a fixed step-sizeα for all tabular experiments. Therefore, there is stochasticity in the update target even in deterministic environments due to exploration of the behavior policy.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found