Goto

Collaborating Authors

 Yang, Xiaokang


MM-ACT: Learn from Multimodal Parallel Generation to Act

Liang, Haotian, Chen, Xinyi, Wang, Bin, Chen, Mingkang, Liu, Yitian, Zhang, Yuhao, Chen, Zanxin, Yang, Tianshuo, Chen, Yilun, Pang, Jiangmiao, Liu, Dong, Yang, Xiaokang, Mu, Yao, Shao, Wenqi, Luo, Ping

arXiv.org Artificial Intelligence

A generalist robotic policy needs both semantic understanding for task planning and the ability to interact with the environment through predictive capabilities. To tackle this, we present MM-ACT, a unified Vision-Language-Action (VLA) model that integrates text, image, and action in shared token space and performs generation across all three modalities. MM-ACT adopts a re-mask parallel decoding strategy for text and image generation, and employs a one-step parallel decoding strategy for action generation to improve efficiency. We introduce Context-Shared Multimodal Learning, a unified training paradigm that supervises generation in all three modalities from a shared context, enhancing action generation through cross-modal learning. Experiments were conducted on the LIBERO simulation and Franka real-robot setups as well as RoboTwin2.0 to assess in-domain and out-of-domain performances respectively. Our approach achieves a success rate of 96.3% on LIBERO, 72.0% across three tasks of real Franka, and 52.38% across eight bimanual tasks of RoboTwin2.0 with an additional gain of 9.25% from cross-modal learning. We release our codes, models and data at https://github.com/HHYHRHY/MM-ACT.


Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

Wang, Qi, Wu, Mian, Zhang, Yuyang, Yuan, Mingqi, Zhang, Wenyao, You, Haoxiang, Wang, Yunbo, Jin, Xin, Yang, Xiaokang, Zeng, Wenjun

arXiv.org Artificial Intelligence

Reinforcement Learning (RL) has achieved remarkable success in various domains, yet it often relies on carefully designed programmatic reward functions to guide agent behavior . Designing such reward functions can be challenging and may not generalize well across different tasks. T o address this limitation, we leverage the rich world knowledge contained in pretrained video diffusion models to provide goal-driven reward signals for RL agents without ad-hoc design of reward. Our key idea is to exploit off-the-shelf video diffusion models pretrained on large-scale video datasets as informative reward functions in terms of video-level and frame-level goals. F or video-level rewards, we first finetune a pretrained video diffusion model on domain-specific datasets and then employ its video encoder to evaluate the alignment between the latent representations of agent's trajectories and the generated goal videos. T o enable more fine-grained goal-achievement, we derive a frame-level goal by identifying the most relevant frame from the generated video using CLIP, which serves as the goal state. W e then employ a learned forward-backward representation that represents the probability of visiting the goal state from a given state-action pair as frame-level reward, promoting more coherent and goal-driven trajectories. Experiments on various Meta-W orld tasks demonstrate the effectiveness of our approach.


Coordinated Humanoid Robot Locomotion with Symmetry Equivariant Reinforcement Learning Policy

Nie, Buqing, Zhang, Yang, Jin, Rongjun, Cao, Zhanxiang, Lin, Huangxuan, Yang, Xiaokang, Gao, Yue

arXiv.org Artificial Intelligence

The human nervous system exhibits bilateral symmetry, enabling coordinated and balanced movements. However, existing Deep Reinforcement Learning (DRL) methods for humanoid robots neglect morphological symmetry of the robot, leading to uncoordinated and suboptimal behaviors. Inspired by human motor control, we propose Symmetry Equivariant Policy (SE-Policy), a new DRL framework that embeds strict symmetry equivariance in the actor and symmetry invariance in the critic without additional hyperparameters. SE-Policy enforces consistent behaviors across symmetric observations, producing temporally and spatially coordinated motions with higher task performance. Extensive experiments on velocity tracking tasks, conducted in both simulation and real-world deployment with the Unitree G1 humanoid robot, demonstrate that SE-Policy improves tracking accuracy by up to 40% compared to state-of-the-art baselines, while achieving superior spatial-temporal coordination. These results demonstrate the effectiveness of SE-Policy and its broad applicability to humanoid robots.


Keep on Going: Learning Robust Humanoid Motion Skills via Selective Adversarial Training

Zhang, Yang, Cao, Zhanxiang, Nie, Buqing, Li, Haoyang, Jiangwei, Zhong, Sun, Qiao, Hu, Xiaoyi, Yang, Xiaokang, Gao, Yue

arXiv.org Artificial Intelligence

Humanoid robots are expected to operate reliably over long horizons while executing versatile whole-body skills. Yet Reinforcement Learning (RL) motion policies typically lose stability under prolonged operation, sensor/actuator noise, and real world disturbances. In this work, we propose a Selective Adversarial Attack for Robust Training (SA2RT) to enhance the robustness of motion skills. The adversary is learned to identify and sparsely perturb the most vulnerable states and actions under an attack-budget constraint, thereby exposing true weakness without inducing conservative overfitting. The resulting non-zero sum, alternating optimization continually strengthens the motion policy against the strongest discovered attacks. We validate our approach on the Unitree G1 humanoid robot across perceptive locomotion and whole-body control tasks. Experimental results show that adversarially trained policies improve the terrain traversal success rate by 40%, reduce the trajectory tracking error by 32%, and maintain long horizon mobility and tracking performance. Together, these results demonstrate that selective adversarial attacks are an effective driver for learning robust, long horizon humanoid motion skills.


FieldGen: From Teleoperated Pre-Manipulation Trajectories to Field-Guided Data Generation

Wang, Wenhao, Ye, Kehe, Zhou, Xinyu, Chen, Tianxing, Min, Cao, Zhu, Qiaoming, Yang, Xiaokang, Luo, Ping, Shen, Yongjian, Yang, Yang, Yao, Maoqing, Mu, Yao

arXiv.org Artificial Intelligence

Large-scale and diverse datasets are vital for training robust robotic manipulation policies, yet existing data collection methods struggle to balance scale, diversity, and quality. Simulation offers scalability but suffers from sim-to-real gaps, while teleoperation yields high-quality demonstrations with limited diversity and high labor cost. We introduce FieldGen, a field-guided data generation framework that enables scalable, diverse, and high-quality real-world data collection with minimal human supervision. FieldGen decomposes manipulation into two stages: a pre-manipulation phase, allowing trajectory diversity, and a fine manipulation phase requiring expert precision. Human demonstrations capture key contact and pose information, after which an attraction field automatically generates diverse trajectories converging to successful configurations. This decoupled design combines scalable trajectory diversity with precise supervision. Moreover, FieldGen-Reward augments generated data with reward annotations to further enhance policy learning. Experiments demonstrate that policies trained with FieldGen achieve higher success rates and improved stability compared to teleoperation-based baselines, while significantly reducing human effort in long-term real-world data collection. Webpage is available at https://fieldgen.github.io/.


DAWP: A framework for global observation forecasting via Data Assimilation and Weather Prediction in satellite observation space

Gong, Junchao, Xu, Jingyi, Fei, Ben, Ling, Fenghua, Zhang, Wenlong, Chen, Kun, Xu, Wanghan, Yang, Weidong, Yang, Xiaokang, Bai, Lei

arXiv.org Artificial Intelligence

Weather prediction is a critical task for human society, where impressive progress has been made by training artificial intelligence weather prediction (AIWP) methods with reanalysis data. However, reliance on reanalysis data limits the AIWPs with shortcomings, including data assimilation biases and temporal discrepancies. To liberate AIWPs from the reanalysis data, observation forecasting emerges as a transformative paradigm for weather prediction. One of the key challenges in observation forecasting is learning spatiotemporal dynamics across disparate measurement systems with irregular high-resolution observation data, which constrains the design and prediction of AIWPs. To this end, we propose our DAWP as an innovative framework to enable AIWPs to operate in a complete observation space by initialization with an artificial intelligence data assimilation (AIDA) module. Specifically, our AIDA module applies a mask multi-modality autoencoder(MMAE)for assimilating irregular satellite observation tokens encoded by mask ViT-VAEs. For AIWP, we introduce a spatiotemporal decoupling transformer with cross-regional boundary conditioning (CBC), learning the dynamics in observation space, to enable sub-image-based global observation forecasting. Comprehensive experiments demonstrate that AIDA initialization significantly improves the roll out and efficiency of AIWP. Additionally, we show that DAWP holds promising potential to be applied in global precipitation forecasting.


Expertise need not monopolize: Action-Specialized Mixture of Experts for Vision-Language-Action Learning

Shen, Weijie, Liu, Yitian, Wu, Yuhao, Liang, Zhixuan, Gu, Sijia, Wang, Dehui, Nian, Tian, Xu, Lei, Qin, Yusen, Pang, Jiangmiao, Guan, Xinping, Yang, Xiaokang, Mu, Yao

arXiv.org Artificial Intelligence

Vision-Language-Action (VLA) models are experiencing rapid development and demonstrating promising capabilities in robotic manipulation tasks. However, scaling up VLA models presents several critical challenges: (1) Training new VLA models from scratch demands substantial computational resources and extensive datasets. Given the current scarcity of robot data, it becomes particularly valuable to fully leverage well-pretrained VLA model weights during the scaling process. (2) Real-time control requires carefully balancing model capacity with computational efficiency. To address these challenges, We propose AdaMoE, a Mixture-of-Experts (MoE) architecture that inherits pretrained weights from dense VLA models, and scales up the action expert by substituting the feedforward layers into sparsely activated MoE layers. AdaMoE employs a decoupling technique that decouples expert selection from expert weighting through an independent scale adapter working alongside the traditional router. This enables experts to be selected based on task relevance while contributing with independently controlled weights, allowing collaborative expert utilization rather than winner-takes-all dynamics. Our approach demonstrates that expertise need not monopolize. Instead, through collaborative expert utilization, we can achieve superior performance while maintaining computational efficiency. AdaMoE consistently outperforms the baseline model across key benchmarks, delivering performance gains of 1.8% on LIBERO and 9.3% on RoboTwin. Most importantly, a substantial 21.5% improvement in real-world experiments validates its practical effectiveness for robotic manipulation tasks.


ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration

Yan, Xianglong, Li, Zhiteng, Zhang, Tianao, Qin, Haotong, Kong, Linghe, Zhang, Yulun, Yang, Xiaokang

arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated remarkable performance, but their long-context reasoning remains constrained by the excessive memory required for the Key-Value (KV) cache. This makes KV cache compression a critical step toward efficient long-context inference. Recent methods have explored low-rank techniques to reduce the hidden size of the KV cache. However, they neglect the distinct roles and varying importance of Keys and Values, leading to significant performance drops under high compression. To address this, we propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values. For Keys, we propose Head-wise Similarity aware Reordering (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation via grouped SVD. For Values, we propose Offline Value Calibration (OVC), which efficiently calibrates the value projection matrix using calibration data without training, ensuring an accurate representation of contextual information. Extensive experiments show that ReCalKV consistently outperforms existing low-rank compression methods, achieving high compression ratios with minimal performance loss. The code and models will be available at:https://github.com/XIANGLONGYAN/ReCalKV.


RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

Chen, Tianxing, Chen, Zanxin, Chen, Baijun, Cai, Zijian, Liu, Yibin, Li, Zixuan, Liang, Qiwei, Lin, Xianliang, Ge, Yiheng, Gu, Zhenyu, Deng, Weiliang, Guo, Yubin, Nian, Tian, Xie, Xuanbing, Chen, Qiangyu, Su, Kailun, Xu, Tianling, Liu, Guodong, Hu, Mengkang, Gao, Huan-ang, Wang, Kaixuan, Liang, Zhixuan, Qin, Yusen, Yang, Xiaokang, Luo, Ping, Mu, Yao

arXiv.org Artificial Intelligence

Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-relevant annotations. Building on this, we design an expert data synthesis pipeline that leverages multimodal language models (MLLMs) and simulation-in-the-loop refinement to automatically generate task-level execution code. To improve sim-to-real transfer, RoboTwin 2.0 applies structured domain randomization along five axes: clutter, lighting, background, tabletop height, and language, enhancing data diversity and policy robustness. The framework is instantiated across 50 dual-arm tasks and five robot embodiments. Empirically, it yields a 10.9% gain in code generation success rate. For downstream policy learning, a VLA model trained with synthetic data plus only 10 real demonstrations achieves a 367% relative improvement over the 10-demo baseline, while zero-shot models trained solely on synthetic data obtain a 228% gain. These results highlight the effectiveness of RoboTwin 2.0 in strengthening sim-to-real transfer and robustness to environmental variations. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation. Project Page: https://robotwin-platform.github.io/, Code: https://github.com/robotwin-Platform/robotwin/.


Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop

Chen, Tianxing, Wang, Kaixuan, Yang, Zhaohui, Zhang, Yuhao, Chen, Zanxin, Chen, Baijun, Dong, Wanxi, Liu, Ziyuan, Chen, Dong, Yang, Tianshuo, Yu, Haibao, Yang, Xiaokang, Qin, Yusen, Xie, Zhiqiang, Mu, Yao, Luo, Ping, Nian, Tian, Deng, Weiliang, Ge, Yiheng, Liu, Yibin, Li, Zixuan, Wang, Dehui, Liang, Zhixuan, Xie, Haohui, Zeng, Rijie, Ge, Yunfei, Cong, Peiqing, He, Guannan, Han, Zhaoming, Yin, Ruocheng, Guo, Jingxiang, Lin, Lunkai, Xu, Tianling, Bi, Hongzhe, Lin, Xuewu, Lin, Tianwei, Luo, Shujie, Li, Keyu, Zhao, Ziyan, Fan, Ke, Xu, Heyang, Peng, Bo, Gao, Wenlong, Li, Dongjiang, Jin, Feng, Shen, Hui, Li, Jinming, Cui, Chaowei, Chen, Yu, Peng, Yaxin, Zeng, Lingdong, Dong, Wenlong, Li, Tengfei, Ke, Weijie, Chen, Jun, Bao, Erdemt, Lan, Tian, Liu, Tenglong, Yang, Jin, Zhuang, Huiping, Jia, Baozhi, Zhang, Shuai, Zou, Zhengfeng, Guan, Fangheng, Jia, Tianyi, Zhou, Ke, Zhang, Hongjiu, Han, Yating, Fang, Cheng, Zou, Yixian, Xu, Chongyang, Zhang, Qinglun, Cheng, Shen, Wang, Xiaohe, Tan, Ping, Fan, Haoqiang, Liu, Shuaicheng, Chen, Jiaheng, Huang, Chuxuan, Lin, Chengliang, Luo, Kaijun, Yue, Boyu, Liu, Yi, Chen, Jinyu, Tan, Zichang, Deng, Liming, Xu, Shuo, Cai, Zijian, Yin, Shilong, Wang, Hao, Liu, Hongshan, Li, Tianyang, Shi, Long, Xu, Ran, Xu, Huilin, Zhang, Zhengquan, Xu, Congsheng, Yang, Jinchang, Xu, Feng

arXiv.org Artificial Intelligence

Embodied Artificial Intelligence (Embodied AI) is an emerging frontier in robotics, driven by the need for autonomous systems that can perceive, reason, and act in complex physical environments. While single-arm systems have shown strong task performance, collaborative dual-arm systems are essential for handling more intricate tasks involving rigid, deformable, and tactile-sensitive objects. To advance this goal, we launched the RoboTwin Dual-Arm Collaboration Challenge at the 2nd MEIS Workshop, CVPR 2025. Built on the RoboTwin Simulation platform (1.0 and 2.0) and the AgileX COBOT-Magic Robot platform, the competition consisted of three stages: Simulation Round 1, Simulation Round 2, and a final Real-World Round. Participants totally tackled 17 dual-arm manipulation tasks, covering rigid, deformable, and tactile-based scenarios. The challenge attracted 64 global teams and over 400 participants, producing top-performing solutions like SEM and AnchorDP3 and generating valuable insights into generalizable bimanual policy learning. This report outlines the competition setup, task design, evaluation methodology, key findings and future direction, aiming to support future research on robust and generalizable bimanual manipulation policies. The Challenge Webpage is available at https://robotwin-benchmark.github.io/cvpr-2025-challenge/.