Duan, Peiqi
On the Evaluation of Generative Robotic Simulations
Chen, Feng, Xu, Botian, Hua, Pu, Duan, Peiqi, Yang, Yanchao, Ma, Yi, Xu, Huazhe
Due to the difficulty of acquiring extensive real-world data, robot simulation has become crucial for parallel training and sim-to-real transfer, highlighting the importance of scalable simulated robotic tasks. Foundation models have demonstrated impressive capacities in autonomously generating feasible robotic tasks. However, this new paradigm underscores the challenge of adequately evaluating these autonomously generated tasks. To address this, we propose a comprehensive evaluation framework tailored to generative simulations. For single-task quality, we evaluate the realism of the generated task and the completeness of the generated trajectories using large language models and vision-language models. In terms of diversity, we measure both task and data diversity through text similarity of task descriptions and world model loss trained on collected task trajectories. For task-level generalization, we assess the zero-shot generalization ability on unseen tasks of a policy trained with multiple generated tasks. Experiments conducted on three representative task generation pipelines demonstrate that the results from our framework are highly consistent with human evaluations, confirming the feasibility and validity of our approach. The findings reveal that while metrics of quality and diversity can be achieved through certain methods, no single approach excels across all metrics, suggesting a need for greater focus on balancing these different metrics. Additionally, our analysis further highlights the common challenge of low generalization capability faced by current works. Embodied artificial intelligence (EAI) is crucial to enable intelligent agents to understand and interact with the physical world.
EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion
Jiang, Jianping, Zhou, Xinyu, Duan, Peiqi, Shi, Boxin
Event cameras and RGB cameras exhibit complementary characteristics in imaging: the former possesses high dynamic range (HDR) and high temporal resolution, while the latter provides rich texture and color information. This makes the integration of event cameras into middle-and high-level RGB-based vision tasks highly promising. However, challenges arise in multi-modal fusion, data annotation, and model architecture design. In this paper, we propose EvPlug, which learns a plug-and-play event and image fusion module from the supervision of the existing RGB-based model. The learned fusion module integrates event streams with image features in the form of a plug-in, endowing the RGB-based model to be robust to HDR and fast motion scenes while enabling high temporal resolution inference. Our method only requires unlabeled event-image pairs (no pixel-wise alignment required) and does not alter the structure or weights of the RGB-based model. We demonstrate the superiority of EvPlug in several vision tasks such as object detection, semantic segmentation, and 3D hand pose estimation.