TheLoCARegret: AConsistentMetrictoEvaluate Model-BasedBehaviorinReinforcementLearning--SupplementaryMaterial -- ATabularExperiments
–Neural Information Processing Systems
For all tabular experiments, we used -greedy exploration with = 0.1. Furthermore, during pretraining and training, we used a maximum episode-length of 100. For evaluation, we set = 0, and ran 10 evaluation episodes. We used a fixed step-sizeα for all tabular experiments. Therefore, there is stochasticity in the update target even in deterministic environments due to exploration of the behavior policy.
Neural Information Processing Systems
Feb-8-2026, 07:44:44 GMT
- Country:
- Technology: