AITopics | mbpo

Q1: ...the claim that the algorithm really manages to align the latent distributions of real and simulated data... We will revise the inappropriate statements in the final version. Q2: In the model adaptation phase, are state-action pairs simply sampled randomly from their respective buffers? Do you have results for a single, monolithic model? Q4: Did you investigate the reasons for the slow learning in the 500 steps on InvertedPendulum compared to PETS? Q1: The experiments shown in Figure 2 do not outperform MBPO beyond the confidence bounds.

artificial intelligence, model adaptation, reviewer, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.34)

Add feedback

Appendix A Reminders about integral probability metrics Let

Neural Information Processing SystemsOct-3-2025, 18:17:40 GMT

In the context of Section 4.1, we have (at least) the following instantiations of Assumption 4.2: (i) Assume the reward is bounded by r We provide a proof for Lemma 4.1 for completeness. Now we prove Theorem 4.2. We first note that a two-sided bound follows from Lemma 4.1: | η We outline the practical MOPO algorithm in Algorithm 2. To answer question (3), we conduct a thorough ablation study on MOPO. The main goal of the ablation study is to understand how the choice of reward penalty affects performance. Require: reward penalty coefficient λ rollout horizon h, rollout batch size b .

dataset, mopo, reward penalty, (14 more...)

Neural Information Processing Systems

Country: North America > United States (0.04)

Industry:

Health & Medicine > Therapeutic Area > Immunology (0.77)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.55)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.69)

Add feedback

a322852ce0df73e204b7e67cbbef0d0a-AuthorFeedback.pdf

Neural Information Processing SystemsOct-3-2025, 18:17:20 GMT

mopo, value function, variance, (16 more...)

Neural Information Processing Systems

Country: North America > United States (0.05)

Industry:

Health & Medicine > Therapeutic Area > Immunology (0.65)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.30)

Add feedback

Appendix for Model based Policy Optimization with Unsupervised Model Adaptation A Omitted Proofs

Neural Information Processing SystemsOct-2-2025, 09:11:26 GMT

Besides Wasserstein distance, we can use other distribution divergence metrics to align the features. MMD is another instance of IPM when the witness function class is the unit ball in a reproducing kernel Hilbert space (RKHS). The results on three environments are shown in Figure 5. We show the one-step model losses during the experiments in the other four environments in Figure D.5. We find that the conclusion in Section 5.2 still holds in these four environments.

ampo, artificial intelligence, machine learning, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Reviewer # 1 1 Q1: the claim that the algorithm really manages to align the latent distributions of real and simulated data

Neural Information Processing SystemsOct-2-2025, 09:11:07 GMT

Q1: ...the claim that the algorithm really manages to align the latent distributions of real and simulated data... We will revise the inappropriate statements in the final version. Q2: In the model adaptation phase, are state-action pairs simply sampled randomly from their respective buffers? Do you have results for a single, monolithic model? Q4: Did you investigate the reasons for the slow learning in the 500 steps on InvertedPendulum compared to PETS? Q1: The experiments shown in Figure 2 do not outperform MBPO beyond the confidence bounds.

artificial intelligence, model adaptation, reviewer, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.34)

Add feedback

Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning

Barkley, Brett, Fridovich-Keil, David

arXiv.org Artificial IntelligenceDec-20-2024

Dyna-style off-policy model-based reinforcement learning (DMBRL) algorithms are a family of techniques for generating synthetic state transition data and thereby enhancing the sample efficiency of off-policy RL algorithms. This paper identifies and investigates a surprising performance gap observed when applying DMBRL algorithms across different benchmark environments with proprioceptive observations. We show that, while DMBRL algorithms perform well in OpenAI Gym, their performance can drop significantly in DeepMind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process -- the backbone of Dyna-style algorithms -- significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.

machine learning, mbpo, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2412.14312

Country: