Zeng, Wenjun
Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning
Wang, Qi, Zhang, Zhipeng, Xie, Baao, Jin, Xin, Wang, Yunbo, Wang, Shiyu, Zheng, Liaomo, Yang, Xiaokang, Zeng, Wenjun
Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentanglement representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.
ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning
Yuan, Mingqi, Li, Bo, Jin, Xin, Zeng, Wenjun
Hyperparameter optimization (HPO) is a billion-dollar problem in machine learning, which significantly impacts the training efficiency and model performance. However, achieving efficient and robust HPO in deep reinforcement learning (RL) is consistently challenging due to its high non-stationarity and computational cost. To tackle this problem, existing approaches attempt to adapt common HPO techniques (e.g., population-based training or Bayesian optimization) to the RL scenario. However, they remain sample-inefficient and computationally expensive, which cannot facilitate a wide range of applications. In this paper, we propose ULTHO, an ultra-lightweight yet powerful framework for fast HPO in deep RL within single runs. Specifically, we formulate the HPO process as a multi-armed bandit with clustered arms (MABC) and link it directly to long-term return optimization. ULTHO also provides a quantified and statistical perspective to filter the HPs efficiently. We test ULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet. Extensive experiments demonstrate that the ULTHO can achieve superior performance with simple architecture, contributing to the development of advanced and automated RL systems.
Adaptive Data Exploitation in Deep Reinforcement Learning
Yuan, Mingqi, Li, Bo, Jin, Xin, Zeng, Wenjun
We introduce ADEPT: Adaptive Data ExPloiTation, a simple yet powerful framework to enhance the **data efficiency** and **generalization** in deep reinforcement learning (RL). Specifically, ADEPT adaptively manages the use of sampled data across different learning stages via multi-armed bandit (MAB) algorithms, optimizing data utilization while mitigating overfitting. Moreover, ADEPT can significantly reduce the computational overhead and accelerate a wide range of RL algorithms. We test ADEPT on benchmarks including Procgen, MiniGrid, and PyBullet. Extensive simulation demonstrates that ADEPT can achieve superior performance with remarkable computational efficiency, offering a practical solution to data-efficient RL. Our code is available at https://github.com/yuanmingqi/ADEPT.
Deep Reinforcement Learning with Hybrid Intrinsic Reward Model
Yuan, Mingqi, Li, Bo, Jin, Xin, Zeng, Wenjun
Intrinsic reward shaping has emerged as a prevalent approach to solving hard-exploration and sparse-rewards environments in reinforcement learning (RL). While single intrinsic rewards, such as curiosity-driven or novelty-based methods, have shown effectiveness, they often limit the diversity and efficiency of exploration. Moreover, the potential and principle of combining multiple intrinsic rewards remains insufficiently explored. To address this gap, we introduce HIRE (Hybrid Intrinsic REward), a flexible and elegant framework for creating hybrid intrinsic rewards through deliberate fusion strategies. With HIRE, we conduct a systematic analysis of the application of hybrid intrinsic rewards in both general and unsupervised RL across multiple benchmarks. Extensive experiments demonstrate that HIRE can significantly enhance exploration efficiency and diversity, as well as skill acquisition in complex and dynamic settings.
Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty
Hahn, Meera, Zeng, Wenjun, Kannen, Nithish, Galt, Rich, Badola, Kartikeya, Kim, Been, Wang, Zi
User prompts for generative AI models are often underspecified, leading to sub-optimal responses. This problem is particularly evident in text-to-image (T2I) generation, where users commonly struggle to articulate their precise intent. This disconnect between the user's vision and the model's interpretation often forces users to painstakingly and repeatedly refine their prompts. To address this, we propose a design for proactive T2I agents equipped with an interface to (1) actively ask clarification questions when uncertain, and (2) present their understanding of user intent as an understandable belief graph that a user can edit. We build simple prototypes for such agents and verify their effectiveness through both human studies and automated evaluation. We observed that at least 90% of human subjects found these agents and their belief graphs helpful for their T2I workflow. Moreover, we develop a scalable automated evaluation approach using two agents, one with a ground truth image and the other tries to ask as few questions as possible to align with the ground truth. On DesignBench, a benchmark we created for artists and designers, the COCO dataset (Lin et al., 2014), and ImageInWords (Garg et al., 2024), we observed that these T2I agents were able to ask informative questions and elicit crucial information to achieve successful alignment with at least 2 times higher VQAScore (Lin et al., 2024) than the standard single-turn T2I generation. Demo: https://github.com/google-deepmind/proactive_t2i_agents.
Open-World Reinforcement Learning over Long Short-Term Imagination
Li, Jiajian, Wang, Qi, Wang, Yunbo, Jin, Xin, Li, Yang, Zeng, Wenjun, Yang, Xiaokang
Training visual reinforcement learning agents in a high-dimensional open world presents significant challenges. While various model-based methods have improved sample efficiency by learning interactive world models, these agents tend to be "short-sighted", as they are typically trained on short snippets of imagined experiences. We argue that the primary obstacle in open-world decision-making is improving the efficiency of off-policy exploration across an extensive state space. In this paper, we present LS-Imagine, which extends the imagination horizon within a limited number of state transition steps, enabling the agent to explore behaviors that potentially lead to promising long-term feedback. The foundation of our approach is to build a long short-term world model. To achieve this, we simulate goal-conditioned jumpy state transitions and compute corresponding affordance maps by zooming in on specific areas within single images. This facilitates the integration of direct long-term values into behavior learning. Our method demonstrates significant improvements over state-of-the-art techniques in MineDojo.
HIMO: A New Benchmark for Full-Body Human Interacting with Multiple Objects
Lv, Xintao, Xu, Liang, Yan, Yichao, Jin, Xin, Xu, Congsheng, Wu, Shuwen, Liu, Yifan, Li, Lincheng, Bi, Mengxiao, Zeng, Wenjun, Yang, Xiaokang
Generating human-object interactions (HOIs) is critical with the tremendous advances of digital avatars. Existing datasets are typically limited to humans interacting with a single object while neglecting the ubiquitous manipulation of multiple objects. Thus, we propose HIMO, a large-scale MoCap dataset of full-body human interacting with multiple objects, containing 3.3K 4D HOI sequences and 4.08M 3D HOI frames. We also annotate HIMO with detailed textual descriptions and temporal segments, benchmarking two novel tasks of HOI synthesis conditioned on either the whole text prompt or the segmented text prompts as fine-grained timeline control. To address these novel tasks, we propose a dual-branch conditional diffusion model with a mutual interaction module for HOI synthesis. Besides, an auto-regressive generation pipeline is also designed to obtain smooth transitions between HOI segments. Experimental results demonstrate the generalization ability to unseen object geometries and temporal compositions.
Real-Time Dynamic Robot-Assisted Hand-Object Interaction via Motion Primitives
Yuan, Mingqi, Wang, Huijiang, Chu, Kai-Fung, Iida, Fumiya, Li, Bo, Zeng, Wenjun
Advances in artificial intelligence (AI) have been propelling the evolution of human-robot interaction (HRI) technologies. However, significant challenges remain in achieving seamless interactions, particularly in tasks requiring physical contact with humans. These challenges arise from the need for accurate real-time perception of human actions, adaptive control algorithms for robots, and the effective coordination between human and robotic movements. In this paper, we propose an approach to enhancing physical HRI with a focus on dynamic robot-assisted hand-object interaction (HOI). Our methodology integrates hand pose estimation, adaptive robot control, and motion primitives to facilitate human-robot collaboration. Specifically, we employ a transformer-based algorithm to perform real-time 3D modeling of human hands from single RGB images, based on which a motion primitives model (MPM) is designed to translate human hand motions into robotic actions. The robot's action implementation is dynamically fine-tuned using the continuously updated 3D hand models. Experimental validations, including a ring-wearing task, demonstrate the system's effectiveness in adapting to real-time movements and assisting in precise task executions.
RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning
Yuan, Mingqi, Castanyer, Roger Creus, Li, Bo, Jin, Xin, Berseth, Glen, Zeng, Wenjun
Extrinsic rewards can effectively guide reinforcement learning (RL) agents in specific tasks. However, extrinsic rewards frequently fall short in complex environments due to the significant human effort needed for their design and annotation. This limitation underscores the necessity for intrinsic rewards, which offer auxiliary and dense signals and can enable agents to learn in an unsupervised manner. Although various intrinsic reward formulations have been proposed, their implementation and optimization details are insufficiently explored and lack standardization, thereby hindering research progress. To address this gap, we introduce RLeXplore, a unified, highly modularized, and plug-and-play framework offering reliable implementations of eight state-of-the-art intrinsic reward algorithms. Furthermore, we conduct an in-depth study that identifies critical implementation details and establishes well-justified standard practices in intrinsically-motivated RL.
ReGenNet: Towards Human Action-Reaction Synthesis
Xu, Liang, Zhou, Yizhou, Yan, Yichao, Jin, Xin, Zhu, Wenhan, Rao, Fengyun, Yang, Xiaokang, Zeng, Wenjun
In this paper, we focus on generative models for static scenes and objects, while the dynamic human actionreaction human action-reaction synthesis, i.e., generating human reactions synthesis for ubiquitous causal human-human interactions given the action sequence of another as conditions. is less explored. Human-human interactions We believe this task will contribute to many applications in can be regarded as asymmetric with actors and reactors AR/VR, games, human-robot interaction, and embodied AI. in atomic interaction periods. In this paper, we comprehensively Modeling human-human interactions is a challenging analyze the asymmetric, dynamic, synchronous, task with the following features: 1) Asymmetric, i.e., the and detailed nature of human-human interactions and propose actor and reactor play asymmetric roles during a causal interaction, the first multi-setting human action-reaction synthesis where one person acts, and the other reacts [78]; benchmark to generate human reactions conditioned on 2) Dynamic, i.e., during the interaction period, the two people given human actions. To begin with, we propose to annotate constantly wave their body parts, move close/away, and the actor-reactor order of the interaction sequences change relative orientations, spatially and temporally; 3) for the NTU120, InterHuman, and Chi3D datasets. Based Synchronous, i.e., typically, one person responds instantly on them, a diffusion-based generative model with a Transformer with others such as an immediate evasion when someone decoder architecture called ReGenNet together with throws a punch, thus the online generation is required; 4) an explicit distance-based interaction loss is proposed to Detailed, i.e., the interaction between humans involves not predict human reactions in an online manner, where the future only coarse body movements together with relative position states of actors are unavailable to reactors.