Guo, Yanjiang
UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent
Zhang, Jianke, Guo, Yanjiang, Hu, Yucheng, Chen, Xiaoyu, Zhu, Xiang, Chen, Jianyu
Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce \textbf{UP-VLA}, a \textbf{U}nified VLA model training with both multi-modal \textbf{U}nderstanding and future \textbf{P}rediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.
Improving Vision-Language-Action Model with Online Reinforcement Learning
Guo, Yanjiang, Zhang, Jianke, Chen, Xiaoyu, Ji, Xiang, Wang, Yen-Jen, Hu, Yucheng, Chen, Jianyu
Recent studies have successfully integrated large vision-language models (VLMs) into low-level robotic control by supervised fine-tuning (SFT) with expert robotic datasets, resulting in what we term vision-language-action (VLA) models. Although the VLA models are powerful, how to improve these large models during interaction with environments remains an open question. In this paper, we explore how to further improve these VLA models via Reinforcement Learning (RL), a commonly used fine-tuning technique for large models. However, we find that directly applying online RL to large VLA models presents significant challenges, including training instability that severely impacts the performance of large models, and computing burdens that exceed the capabilities of most local machines. To address these challenges, we propose iRe-VLA framework, which iterates between Reinforcement Learning and Supervised Learning to effectively improve VLA models, leveraging the exploratory benefits of RL while maintaining the stability of supervised learning. Experiments in two simulated benchmarks and a real-world manipulation suite validate the effectiveness of our method.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Hu, Yucheng, Guo, Yanjiang, Wang, Pengchao, Chen, Xiaoyu, Wang, Yen-Jen, Zhang, Jianke, Sreenath, Koushil, Lu, Chaochao, Chen, Jianyu
Recent advancements in robotics have focused on developing generalist policies capable of performing multiple tasks. Typically, these policies utilize pre-trained vision encoders to capture crucial information from current observations. However, previous vision encoders, which trained on two-image contrastive learning or single-image reconstruction, can not perfectly capture the sequential information essential for embodied tasks. Recently, video diffusion models (VDMs) have demonstrated the capability to accurately predict future image sequences, exhibiting a good understanding of physical dynamics. Motivated by the strong visual prediction capabilities of VDMs, we hypothesize that they inherently possess visual representations that reflect the evolution of the physical world, which we term predictive visual representations. Building on this hypothesis, we propose the Video Prediction Policy (VPP), a generalist robotic policy conditioned on the predictive visual representations from VDMs. To further enhance these representations, we incorporate diverse human or robotic manipulation datasets, employing unified video-generation training objectives. VPP consistently outperforms existing methods across two simulated and two real-world benchmarks. Notably, it achieves a 28.1\% relative improvement in the Calvin ABC-D benchmark compared to the previous state-of-the-art and delivers a 28.8\% increase in success rates for complex real-world dexterous manipulation tasks.
Prediction with Action: Visual Policy Learning via Joint Denoising Process
Guo, Yanjiang, Hu, Yucheng, Zhang, Jianke, Wang, Yen-Jen, Chen, Xiaoyu, Lu, Chaochao, Chen, Jianyu
Diffusion models have demonstrated remarkable capabilities in image generation tasks, including image editing and video creation, representing a good understanding of the physical world. On the other line, diffusion models have also shown promise in robotic control tasks by denoising actions, known as diffusion policy. Although the diffusion generative model and diffusion policy exhibit distinct capabilities--image prediction and robotic action, respectively--they technically follow a similar denoising process. In robotic tasks, the ability to predict future images and generate actions is highly correlated since they share the same underlying dynamics of the physical world. Building on this insight, we introduce PAD, a novel visual policy learning framework that unifies image Prediction and robot Action within a joint Denoising process. Specifically, PAD utilizes Diffusion Transformers (DiT) to seamlessly integrate images and robot states, enabling the simultaneous prediction of future images and robot actions. Additionally, PAD supports co-training on both robotic demonstrations and large-scale video datasets and can be easily extended to other robotic modalities, such as depth images. PAD outperforms previous methods, achieving a significant 26.3% relative improvement on the full Metaworld benchmark, by utilizing a single text-conditioned visual policy within a data-efficient imitation learning setting. Furthermore, PAD demonstrates superior generalization to unseen tasks in real-world robot manipulation settings with 28.0% success rate increase compared to the strongest baseline.
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
Zhang, Jianke, Guo, Yanjiang, Chen, Xiaoyu, Wang, Yen-Jen, Hu, Yucheng, Shi, Chengming, Chen, Jianyu
Large Vision-Language-Action (VLA) models, leveraging powerful pre trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, in static tasks, we double the control frequency and achieve comparable success rates. Additionally, on novel real-world dynamic ma nipulation tasks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.
DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment
Guo, Yanjiang, Wang, Yen-Jen, Zha, Lihan, Jiang, Zheyuan, Chen, Jianyu
Large language models (LLMs) encode a vast amount of semantic knowledge and possess remarkable understanding and reasoning capabilities. Previous work has explored how to ground LLMs in robotic tasks to generate feasible and executable textual plans. However, low-level execution in the physical world may deviate from the high-level textual plan due to environmental perturbations or imperfect controller design. In this paper, we propose \textbf{DoReMi}, a novel language model grounding framework that enables immediate Detection and Recovery from Misalignments between plan and execution. Specifically, we leverage LLMs to play a dual role, aiding not only in high-level planning but also generating constraints that can indicate misalignment during execution. Then vision language models (VLMs) are utilized to detect constraint violations continuously. Our pipeline can monitor the low-level execution and enable timely recovery if certain plan-execution misalignment occurs. Experiments on various complex tasks including robot arms and humanoid robots demonstrate that our method can lead to higher task success rates and shorter task completion times. Videos of DoReMi are available at \url{https://sites.google.com/view/doremi-paper}.
Decentralized Motor Skill Learning for Complex Robotic Systems
Guo, Yanjiang, Jiang, Zheyuan, Wang, Yen-Jen, Gao, Jingyue, Chen, Jianyu
Reinforcement learning (RL) has achieved remarkable success in complex robotic systems (eg. quadruped locomotion). In previous works, the RL-based controller was typically implemented as a single neural network with concatenated observation input. However, the corresponding learned policy is highly task-specific. Since all motors are controlled in a centralized way, out-of-distribution local observations can impact global motors through the single coupled neural network policy. In contrast, animals and humans can control their limbs separately. Inspired by this biological phenomenon, we propose a Decentralized motor skill (DEMOS) learning algorithm to automatically discover motor groups that can be decoupled from each other while preserving essential connections and then learn a decentralized motor control policy. Our method improves the robustness and generalization of the policy without sacrificing performance. Experiments on quadruped and humanoid robots demonstrate that the learned policy is robust against local motor malfunctions and can be transferred to new tasks.
Reinforcement learning with Demonstrations from Mismatched Task under Sparse Reward
Guo, Yanjiang, Gao, Jingyue, Wu, Zheng, Shi, Chengming, Chen, Jianyu
Reinforcement learning has been applied to various real-world tasks, including robotic manipulation with large state-action spaces and sparse reward signals [1]. In these tasks, standard reinforcement learning tends to perform a lot of useless exploration and easily fall into local optimal solutions. To eliminate this problem, previous works often use expert demonstrations to aid online learning, which adopt some successful trajectories to guide the exploration process [2, 3]. However, standard learning from demonstration algorithms often assume that the target leaning task is exactly same with the task where demonstrations are collected [4, 5, 6]. Under this assumption, experts need to collect the corresponding demonstration for each new task, which can be expensive and inefficient. In this paper, we consider a new learning setting where expert data is collected under a single task, while the agent is required to solve different new tasks. For instance as shown in Figure 1, a robot arm aims to solve peg-in-hole tasks.The demonstration is collected on a certain type of hole while the target tasks have different hole shapes (changes in environmental dynamics) or position shifts (changes in reward function). This can be challenging as agents cannot directly imitate those demonstrations from mismatched tasks due to dynamics and reward function changes. However, compared to learning from scratch, those demonstrations should still be able to provide some useful information to help exploration.