Goto

Collaborating Authors

 dreamerv2




cf708fc1decf0337aded484f8f4519ae-Supplemental.pdf

Neural Information Processing Systems

We found that training an inverse model is crucial for learning good representations. On the first row,alevel from each environment that one-shot PPGS fails tosolve(thewhitearrowsrepresent thepolicy). Iterative Model Improvement In general settings, collecting training trajectories by sampling actions uniformly atrandom does not grant sufficient coverage ofthe state space. GLAMORGLAMOR [34] learns inverse dynamics to achieve visual goals in Atari games. The only difference withPPGS in terms of settings is that we allowGLAMORto collect data on-policy and for more interactions (2M).







PrivilegedDreamer: Explicit Imagination of Privileged Information for Rapid Adaptation of Learned Policies

Byrd, Morgan, Crandell, Jackson, Das, Mili, Inman, Jessica, Wright, Robert, Ha, Sehoon

arXiv.org Artificial Intelligence

Numerous real-world control problems involve dynamics and objectives affected by unobservable hidden parameters, ranging from autonomous driving to robotic manipulation, which cause performance degradation during sim-to-real transfer. To represent these kinds of domains, we adopt hidden-parameter Markov decision processes (HIP-MDPs), which model sequential decision problems where hidden variables parameterize transition and reward functions. Existing approaches, such as domain randomization, domain adaptation, and meta-learning, simply treat the effect of hidden parameters as additional variance and often struggle to effectively handle HIP-MDP problems, especially when the rewards are parameterized by hidden variables. We introduce Privileged-Dreamer, a model-based reinforcement learning framework that extends the existing model-based approach by incorporating an explicit parameter estimation module. PrivilegedDreamer features its novel dual recurrent architecture that explicitly estimates hidden parameters from limited historical data and enables us to condition the model, actor, and critic networks on these estimated parameters. Our empirical analysis on five diverse HIP-MDP tasks demonstrates that PrivilegedDreamer outperforms state-of-the-art model-based, model-free, and domain adaptation learning algorithms. Additionally, we conduct ablation studies to justify the inclusion of each component in the proposed architecture.


HarmonyDream: Task Harmonization Inside World Models

Ma, Haoyu, Wu, Jialong, Feng, Ningya, Xiao, Chenjun, Li, Dong, Hao, Jianye, Wang, Jianmin, Long, Mingsheng

arXiv.org Artificial Intelligence

Model-based reinforcement learning (MBRL) holds the promise of sample-efficient learning by utilizing a world model, which models how the environment works and typically encompasses components for two tasks: observation modeling and reward modeling. In this paper, through a dedicated empirical investigation, we gain a deeper understanding of the role each task plays in world models and uncover the overlooked potential of sample-efficient MBRL by mitigating the domination of either observation or reward modeling. Our key insight is that while prevalent approaches of explicit MBRL attempt to restore abundant details of the environment via observation models, it is difficult due to the environment's complexity and limited model capacity. On the other hand, reward models, while dominating implicit MBRL and adept at learning compact task-centric dynamics, are inadequate for sample-efficient learning without richer learning signals. Motivated by these insights and discoveries, we propose a simple yet effective approach, HarmonyDream, which automatically adjusts loss coefficients to maintain task harmonization, i.e. a dynamic equilibrium between the two tasks in world model learning. Our experiments show that the base MBRL method equipped with HarmonyDream gains 10%-69% absolute performance boosts on visual robotic tasks and sets a new state-of-the-art result on the Atari 100K benchmark.