Not enough data to create a plot.
Try a different view from the menu above.
An, Pengju
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Liu, Jiaming, Chen, Hao, An, Pengju, Liu, Zhuoyang, Zhang, Renrui, Gu, Chenyang, Li, Xiaoqi, Guo, Ziyu, Chen, Sixiang, Liu, Mengzhen, Hou, Chengkai, Zhao, Mengdi, Zhou, KC alex, Heng, Pheng-Ann, Zhang, Shanghang
Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
Ji, Yuheng, Tan, Huajie, Shi, Jiayu, Hao, Xiaoshuai, Zhang, Yuan, Zhang, Hengyuan, Wang, Pengwei, Zhao, Mengdi, Mu, Yao, An, Pengju, Xue, Xinda, Su, Qinghang, Lyu, Huaihai, Zheng, Xiaolong, Liu, Jiaming, Wang, Zhongyuan, Zhang, Shanghang
Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: Planning Capability, which involves decomposing complex manipulation instructions into manageable sub-tasks; Affordance Perception, the ability to recognize and interpret the affordances of interactive objects; and Trajectory Prediction, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
Wu, Kun, Hou, Chengkai, Liu, Jiaming, Che, Zhengping, Ju, Xiaozhu, Yang, Zhuqin, Li, Meng, Zhao, Yinuo, Xu, Zhiyuan, Yang, Guang, Zhao, Zhen, Li, Guangyu, Jin, Zhao, Wang, Lecheng, Mao, Jilei, Wang, Xinhua, Fan, Shichao, Liu, Ning, Ren, Pei, Zhang, Qiang, Lyu, Yaoxu, Liu, Mengzhen, He, Jingyang, Luo, Yulin, Gao, Zeyu, Li, Chenxuan, Gu, Chenyang, Fu, Yankai, Wu, Di, Wang, Xingyu, Chen, Sixiang, Wang, Zhenyu, An, Pengju, Qian, Siyuan, Zhang, Shanghang, Tang, Jian
Developing robust and general-purpose robotic manipulation policies is a key goal in the field of robotics. To achieve effective generalization, it is essential to construct comprehensive datasets that encompass a large number of demonstration trajectories and diverse tasks. Unlike vision or language data that can be collected from the Internet, robotic datasets require detailed observations and manipulation actions, necessitating significant investment in hardware-software infrastructure and human labor. While existing works have focused on assembling various individual robot datasets, there remains a lack of a unified data collection standard and insufficient diversity in tasks, scenarios, and robot types. In this paper, we introduce RoboMIND (Multi-embodiment Intelligence Normative Data for Robot manipulation), featuring 55k real-world demonstration trajectories across 279 diverse tasks involving 61 different object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information, including multi-view RGB-D images, proprioceptive robot state information, end effector details, and linguistic task descriptions. To ensure dataset consistency and reliability during policy learning, RoboMIND is built on a unified data collection platform and standardized protocol, covering four distinct robotic embodiments. We provide a thorough quantitative and qualitative analysis of RoboMIND across multiple dimensions, offering detailed insights into the diversity of our datasets. In our experiments, we conduct extensive real-world testing with four state-of-the-art imitation learning methods, demonstrating that training with RoboMIND data results in a high manipulation success rate and strong generalization. Our project is at https://x-humanoid-robomind.github.io/.
$\textit{S}^3$Gaussian: Self-Supervised Street Gaussians for Autonomous Driving
Huang, Nan, Wei, Xiaobao, Zheng, Wenzhao, An, Pengju, Lu, Ming, Zhan, Wei, Tomizuka, Masayoshi, Keutzer, Kurt, Zhang, Shanghang
Photorealistic 3D reconstruction of street scenes is a critical technique for developing real-world simulators for autonomous driving. Despite the efficacy of Neural Radiance Fields (NeRF) for driving scenes, 3D Gaussian Splatting (3DGS) emerges as a promising direction due to its faster speed and more explicit representation. However, most existing street 3DGS methods require tracked 3D vehicle bounding boxes to decompose the static and dynamic elements for effective reconstruction, limiting their applications for in-the-wild scenarios. To facilitate efficient 3D scene reconstruction without costly annotations, we propose a self-supervised street Gaussian ($\textit{S}^3$Gaussian) method to decompose dynamic and static elements from 4D consistency. We represent each scene with 3D Gaussians to preserve the explicitness and further accompany them with a spatial-temporal field network to compactly model the 4D dynamics. We conduct extensive experiments on the challenging Waymo-Open dataset to evaluate the effectiveness of our method. Our $\textit{S}^3$Gaussian demonstrates the ability to decompose static and dynamic scenes and achieves the best performance without using 3D annotations. Code is available at: https://github.com/nnanhuang/S3Gaussian/.