The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning -- Supplementary Material -- AT abular Experiments

Neural Information Processing Systems 

Here, we discuss some additional settings for the tabular experiments. The reason for this is that Sarsa(0.95), in contrast to MB-VI and MB-SU, is a multi-step Therefore, there is stochasticity in the update target even in deterministic environments due to exploration of the behavior policy. All methods used optimistic initialization. The pseudocode of the tabular, on-policy method used in Section 5.1 is shown in Algorithm 1. These estimates are updated at the end of the episode, using the data gathered during the episode.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found