Gu, Chenyang
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
Liu, Jiaming, Chen, Hao, An, Pengju, Liu, Zhuoyang, Zhang, Renrui, Gu, Chenyang, Li, Xiaoqi, Guo, Ziyu, Chen, Sixiang, Liu, Mengzhen, Hou, Chengkai, Zhao, Mengdi, Zhou, KC alex, Heng, Pheng-Ann, Zhang, Shanghang
Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods leverage large-scale pretrained knowledge, they disrupt the continuity of actions. Meanwhile, some VLA methods incorporate an additional diffusion head to predict continuous actions, relying solely on VLM-extracted features, which limits their reasoning capabilities. In this paper, we introduce HybridVLA, a unified framework that seamlessly integrates the strengths of both autoregressive and diffusion policies within a single large language model, rather than simply connecting them. To bridge the generation gap, a collaborative training recipe is proposed that injects the diffusion modeling directly into the next-token prediction. With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks. Therefore, we design a collaborative action ensemble mechanism that adaptively fuses these two predictions, leading to more robust control. In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks, including both single-arm and dual-arm robots, while demonstrating stable manipulation in previously unseen configurations.
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
Wu, Kun, Hou, Chengkai, Liu, Jiaming, Che, Zhengping, Ju, Xiaozhu, Yang, Zhuqin, Li, Meng, Zhao, Yinuo, Xu, Zhiyuan, Yang, Guang, Zhao, Zhen, Li, Guangyu, Jin, Zhao, Wang, Lecheng, Mao, Jilei, Wang, Xinhua, Fan, Shichao, Liu, Ning, Ren, Pei, Zhang, Qiang, Lyu, Yaoxu, Liu, Mengzhen, He, Jingyang, Luo, Yulin, Gao, Zeyu, Li, Chenxuan, Gu, Chenyang, Fu, Yankai, Wu, Di, Wang, Xingyu, Chen, Sixiang, Wang, Zhenyu, An, Pengju, Qian, Siyuan, Zhang, Shanghang, Tang, Jian
Developing robust and general-purpose robotic manipulation policies is a key goal in the field of robotics. To achieve effective generalization, it is essential to construct comprehensive datasets that encompass a large number of demonstration trajectories and diverse tasks. Unlike vision or language data that can be collected from the Internet, robotic datasets require detailed observations and manipulation actions, necessitating significant investment in hardware-software infrastructure and human labor. While existing works have focused on assembling various individual robot datasets, there remains a lack of a unified data collection standard and insufficient diversity in tasks, scenarios, and robot types. In this paper, we introduce RoboMIND (Multi-embodiment Intelligence Normative Data for Robot manipulation), featuring 55k real-world demonstration trajectories across 279 diverse tasks involving 61 different object classes. RoboMIND is collected through human teleoperation and encompasses comprehensive robotic-related information, including multi-view RGB-D images, proprioceptive robot state information, end effector details, and linguistic task descriptions. To ensure dataset consistency and reliability during policy learning, RoboMIND is built on a unified data collection platform and standardized protocol, covering four distinct robotic embodiments. We provide a thorough quantitative and qualitative analysis of RoboMIND across multiple dimensions, offering detailed insights into the diversity of our datasets. In our experiments, we conduct extensive real-world testing with four state-of-the-art imitation learning methods, demonstrating that training with RoboMIND data results in a high manipulation success rate and strong generalization. Our project is at https://x-humanoid-robomind.github.io/.
Estimation of causal effects of multiple treatments in healthcare database studies with rare outcomes
Hu, Liangyuan, Gu, Chenyang
The preponderance of large-scale healthcare databases provide abundant opportunities for comparative effectiveness research. Evidence necessary to making informed treatment decisions often relies on comparing effectiveness of multiple treatment options on outcomes of interest observed in a small number of individuals. Causal inference with multiple treatments and rare outcomes is a subject that has been treated sparingly in the literature. This paper designs three sets of simulations, representative of the structure of our healthcare database study, and propose causal analysis strategies for such settings. We investigate and compare the operating characteristics of three types of methods and their variants: Bayesian Additive Regression Trees (BART), regression adjustment on multivariate spline of generalized propensity scores (RAMS) and inverse probability of treatment weighting (IPTW) with multinomial logistic regression or generalized boosted models. Our results suggest that BART and RAMS provide lower bias and mean squared error, and the widely used IPTW methods deliver unfavorable operating characteristics. We illustrate the methods using a case study evaluating the comparative effectiveness of robotic-assisted surgery, video-assisted thoracoscopic surgery and open thoracotomy for treating non-small cell lung cancer.