mbpo
Reviewer # 1 1 Q1: the claim that the algorithm really manages to align the latent distributions of real and simulated data
Q1: ...the claim that the algorithm really manages to align the latent distributions of real and simulated data... We will revise the inappropriate statements in the final version. Q2: In the model adaptation phase, are state-action pairs simply sampled randomly from their respective buffers? Do you have results for a single, monolithic model? Q4: Did you investigate the reasons for the slow learning in the 500 steps on InvertedPendulum compared to PETS? Q1: The experiments shown in Figure 2 do not outperform MBPO beyond the confidence bounds.
Appendix A Reminders about integral probability metrics Let
In the context of Section 4.1, we have (at least) the following instantiations of Assumption 4.2: (i) Assume the reward is bounded by r We provide a proof for Lemma 4.1 for completeness. Now we prove Theorem 4.2. We first note that a two-sided bound follows from Lemma 4.1: | ฮท We outline the practical MOPO algorithm in Algorithm 2. To answer question (3), we conduct a thorough ablation study on MOPO. The main goal of the ablation study is to understand how the choice of reward penalty affects performance. Require: reward penalty coefficient ฮป rollout horizon h, rollout batch size b .
Appendix for Model based Policy Optimization with Unsupervised Model Adaptation A Omitted Proofs
Besides Wasserstein distance, we can use other distribution divergence metrics to align the features. MMD is another instance of IPM when the witness function class is the unit ball in a reproducing kernel Hilbert space (RKHS). The results on three environments are shown in Figure 5. We show the one-step model losses during the experiments in the other four environments in Figure D.5. We find that the conclusion in Section 5.2 still holds in these four environments.
Reviewer # 1 1 Q1: the claim that the algorithm really manages to align the latent distributions of real and simulated data
Q1: ...the claim that the algorithm really manages to align the latent distributions of real and simulated data... We will revise the inappropriate statements in the final version. Q2: In the model adaptation phase, are state-action pairs simply sampled randomly from their respective buffers? Do you have results for a single, monolithic model? Q4: Did you investigate the reasons for the slow learning in the 500 steps on InvertedPendulum compared to PETS? Q1: The experiments shown in Figure 2 do not outperform MBPO beyond the confidence bounds.
Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning
Barkley, Brett, Fridovich-Keil, David
Dyna-style off-policy model-based reinforcement learning (DMBRL) algorithms are a family of techniques for generating synthetic state transition data and thereby enhancing the sample efficiency of off-policy RL algorithms. This paper identifies and investigates a surprising performance gap observed when applying DMBRL algorithms across different benchmark environments with proprioceptive observations. We show that, while DMBRL algorithms perform well in OpenAI Gym, their performance can drop significantly in DeepMind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process -- the backbone of Dyna-style algorithms -- significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.