The LoCA Regret: A Consistent Metric to Evaluate Model-Based Behavior in Reinforcement Learning -- Supplementary Material -- AT abular Experiments
–Neural Information Processing Systems
Here, we discuss some additional settings for the tabular experiments. The reason for this is that Sarsa(0.95), in contrast to MB-VI and MB-SU, is a multi-step Therefore, there is stochasticity in the update target even in deterministic environments due to exploration of the behavior policy. All methods used optimistic initialization. The pseudocode of the tabular, on-policy method used in Section 5.1 is shown in Algorithm 1. These estimates are updated at the end of the episode, using the data gathered during the episode.
Neural Information Processing Systems
Oct-2-2025, 20:18:27 GMT