Goto

Collaborating Authors

 Liu, Huaping


Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models

arXiv.org Artificial Intelligence

By injecting action components into the VLMs, Vision-Language-Action models (VLAs) can be naturally formed and also show promising performance. Existing work has demonstrated the effectiveness and generalization of VLAs in multiple scenarios and tasks. Nevertheless, the transfer from VLMs to VLAs is not trivial since existing VLAs differ in their backbones, action-prediction formulations, data distributions, and training recipes. This leads to a missing piece for a systematic understanding of the design choices of VLAs. In this work, we disclose the key factors that significantly influence the performance of VLA and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data. The obtained results convince us firmly to explain why we prefer VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research.


Bootstrapping Heterogeneous Graph Representation Learning via Large Language Models: A Generalized Approach

arXiv.org Artificial Intelligence

Graph representation learning methods are highly effective in handling complex non-Euclidean data by capturing intricate relationships and features within graph structures. However, traditional methods face challenges when dealing with heterogeneous graphs that contain various types of nodes and edges due to the diverse sources and complex nature of the data. Existing Heterogeneous Graph Neural Networks (HGNNs) have shown promising results but require prior knowledge of node and edge types and unified node feature formats, which limits their applicability. Recent advancements in graph representation learning using Large Language Models (LLMs) offer new solutions by integrating LLMs' data processing capabilities, enabling the alignment of various graph representations. Nevertheless, these methods often overlook heterogeneous graph data and require extensive preprocessing. To address these limitations, we propose a novel method that leverages the strengths of both LLM and GNN, allowing for the processing of graph data with any format and type of nodes and edges without the need for type information or special preprocessing. Our method employs LLM to automatically summarize and classify different data formats and types, aligns node features, and uses a specialized GNN for targeted learning, thus obtaining effective graph representations for downstream tasks. Theoretical analysis and experimental validation have demonstrated the effectiveness of our method.


A High-frequency Pneumatic Oscillator for Soft Robotics

arXiv.org Artificial Intelligence

Soft robots, while highly adaptable to diverse environments through various actuation methods, still face significant performance boundary due to the inherent properties of materials. These limitations manifest in the challenge of guaranteeing rapid response and large-scale movements simultaneously, ultimately restricting the robots' absolute speed and overall efficiency. In this paper, we introduce a high-frequency pneumatic oscillator (HIPO) to overcome these challenges. Through a collision-induced phase resetting mechanism, our HIPO leverages event-based nonlinearity to trigger self-oscillation of pneumatic actuator, which positively utilizes intrinsic characteristics of materials. This enables the system to spontaneously generate periodic control signals and directly produce motion responses, eliminating the need for incorporating external actuation components. By efficiently and rapidly converting internal energy of airflow into the kinetic energy of robots, HIPO achieves a frequency of up to 20 Hz. Furthermore, we demonstrate the versatility and high-performance capabilities of HIPO through bio-inspired robots: an insect-like fast-crawler (with speeds up to 50.27 cm/s), a high-frequency butterfly-like wing-flapper, and a maneuverable duck-like swimmer. By eliminating external components and seamlessly fusing signal generation, energy conversion, and motion output, HIPO unleashes rapid and efficient motion, unlocking potential for high-performance soft robotics.


MIPD: A Multi-sensory Interactive Perception Dataset for Embodied Intelligent Driving

arXiv.org Artificial Intelligence

During the process of driving, humans usually rely on multiple senses to gather information and make decisions. Analogously, in order to achieve embodied intelligence in autonomous driving, it is essential to integrate multidimensional sensory information in order to facilitate interaction with the environment. However, the current multi-modal fusion sensing schemes often neglect these additional sensory inputs, hindering the realization of fully autonomous driving. This paper considers multi-sensory information and proposes a multi-modal interactive perception dataset named MIPD, enabling expanding the current autonomous driving algorithm framework, for supporting the research on embodied intelligent driving. In addition to the conventional camera, lidar, and 4D radar data, our dataset incorporates multiple sensor inputs including sound, light intensity, vibration intensity and vehicle speed to enrich the dataset comprehensiveness. Comprising 126 consecutive sequences, many exceeding twenty seconds, MIPD features over 8,500 meticulously synchronized and annotated frames. Moreover, it encompasses many challenging scenarios, covering various road and lighting conditions. The dataset has undergone thorough experimental validation, producing valuable insights for the exploration of next-generation autonomous driving frameworks.


Multimodal Information Bottleneck for Deep Reinforcement Learning with Multiple Sensors

arXiv.org Artificial Intelligence

Reinforcement learning has achieved promising results on robotic control tasks but struggles to leverage information effectively from multiple sensory modalities that differ in many characteristics. Recent works construct auxiliary losses based on reconstruction or mutual information to extract joint representations from multiple sensory inputs to improve the sample efficiency and performance of reinforcement learning algorithms. However, the representations learned by these methods could capture information irrelevant to learning a policy and may degrade the performance. We argue that compressing information in the learned joint representations about raw multimodal observations is helpful, and propose a multimodal information bottleneck model to learn task-relevant joint representations from egocentric images and proprioception. Our model compresses and retains the predictive information in multimodal observations for learning a compressed joint representation, which fuses complementary information from visual and proprioceptive feedback and meanwhile filters out task-irrelevant information in raw multimodal observations. We propose to minimize the upper bound of our multimodal information bottleneck objective for computationally tractable optimization. Experimental evaluations on several challenging locomotion tasks with egocentric images and proprioception show that our method achieves better sample efficiency and zero-shot robustness to unseen white noise than leading baselines. We also empirically demonstrate that leveraging information from egocentric images and proprioception is more helpful for learning policies on locomotion tasks than solely using one single modality.


Observe Then Act: Asynchronous Active Vision-Action Model for Robotic Manipulation

arXiv.org Artificial Intelligence

In real-world scenarios, many robotic manipulation tasks are hindered by occlusions and limited fields of view, posing significant challenges for passive observation-based models that rely on fixed or wrist-mounted cameras. In this paper, we investigate the problem of robotic manipulation under limited visual observation and propose a task-driven asynchronous active vision-action model.Our model serially connects a camera Next-Best-View (NBV) policy with a gripper Next-Best Pose (NBP) policy, and trains them in a sensor-motor coordination framework using few-shot reinforcement learning. This approach allows the agent to adjust a third-person camera to actively observe the environment based on the task goal, and subsequently infer the appropriate manipulation actions.We trained and evaluated our model on 8 viewpoint-constrained tasks in RLBench. The results demonstrate that our model consistently outperforms baseline algorithms, showcasing its effectiveness in handling visual constraints in manipulation tasks.


AssistantX: An LLM-Powered Proactive Assistant in Collaborative Human-Populated Environment

arXiv.org Artificial Intelligence

The increasing demand for intelligent assistants in human-populated environments has motivated significant research in autonomous robotic systems. Traditional service robots and virtual assistants, however, struggle with real-world task execution due to their limited capacity for dynamic reasoning and interaction, particularly when human collaboration is required. Recent developments in Large Language Models have opened new avenues for improving these systems, enabling more sophisticated reasoning and natural interaction capabilities. In this paper, we introduce AssistantX, an LLM-powered proactive assistant designed to operate autonomously in a physical office environment. Unlike conventional service robots, AssistantX leverages a novel multi-agent architecture, PPDR4X, which provides advanced inference capabilities and comprehensive collaboration awareness. By effectively bridging the gap between virtual operations and physical interactions, AssistantX demonstrates robust performance in managing complex real-world scenarios. Our evaluation highlights the architecture's effectiveness, showing that AssistantX can respond to clear instructions, actively retrieve supplementary information from memory, and proactively seek collaboration from team members to ensure successful task completion. More details and videos can be found at https://assistantx-agent.github.io/AssistantX/.


Learning a Distributed Hierarchical Locomotion Controller for Embodied Cooperation

arXiv.org Artificial Intelligence

Cooperatively accomplishing embodied tasks by multiple robots has consistently been a highly challenging area of research. Recent studies mainly focus on embodied manipulation cooperation among robotic arms or formation control over the upper level within a group of mobile robots [1, 2]. Nevertheless, multi-agent cooperation via whole-body and end-to-end locomotion control is rarely studied. Some previous works showcase the manipulation via locomotion [3] but are only tested on two agent systems, and the scalability of this method is still agnostic for migration to any number of agent populations. In this work, we aim to realize more complex embodied multi-agent cooperation by learning a distributed hierarchical locomotion control system, decomposing the complex and coupled behaviours while maintaining the potential for unlimited expansion on the swarm. As the foundation for implementation and validation, we construct three scenarios in IsaacSim/Gym [4] as benchmarks for embodied cooperation study. Concurrently, training a robot for a specific function can be effectively achieved through reinforcement learning (RL), like learning movement patterns [5], interactive behaviours [6], as well as logical inference in games [7]. Although RL provides a recognized powerful exploration capability and tremendous progress has been made in sampling efficiency [4], finding and mastering a sequence of sophisticated tasks through searching remains a challenging problem. Hierarchical reinforcement learning (HRL) alleviates this to a certain extent, aiming to understand the logical relationships among "control, action, behaviour, dynamic outcomes, and feedback" in a segmented manner.


Leveraging Large Language Model for Heterogeneous Ad Hoc Teamwork Collaboration

arXiv.org Artificial Intelligence

Abstract--Compared with the widely investigated homogeneous multi-robot collaboration, heterogeneous robots with different capabilities can provide a more efficient and flexible collaboration for more complex tasks. In this paper, we consider a more challenging heterogeneous ad hoc teamwork collaboration problem where an ad hoc robot joins an existing heterogeneous team for a shared goal. Specifically, the ad hoc robot collaborates with unknown teammates without prior coordination, and it is expected to generate an appropriate cooperation policy to improve the efficiency of the whole team. To solve this challenging problem, we leverage the remarkable potential of the large language model (LLM) to establish a decentralized heterogeneous ad hoc teamwork collaboration framework that focuses on generating reasonable policy for an ad hoc robot to collaborate with original heterogeneous teammates. A training-free hierarchical dynamic planner is developed using the LLM together with the newly proposed Interactive Reflection of Thoughts (IRoT) method for the ad hoc agent to adapt to different teams. Then, the new team collaborates and finally finishes the task. Imagine after a natural disaster such as an earthquake or team at any time from any location, and then a heterogeneous hurricane, a team of robots is dispatched for the rescue task. Since the situation of a disaster site is complex, robots of During the past years, the multi-robot collaboration task different capabilities may be required for the rescue. These has been widely investigated, and a bunch of multi-agent robots are likely to be brought from different places and thus embodied tasks are proposed where multiple agents learn arrive at the site at different times. The coming robot doesn't proper strategies to collaborate efficiently [17, 18, 22, 23, 42, have any prior information on existing teammates, and it is 44, 45, 52] and solve complex embodied tasks [27, 35]. All expected to collaborate efficiently and robustly with previously these works only consider homogeneous agents with the same unknown teammates for the same goal. However, in real-world applications, the robots describes a typical heterogeneous ad hoc teamwork, and the may be faced with more complicated situations such as seismic new coming robot is called an ad hoc robot. It is necessary to leverage heterogeneous ad hoc teamwork collaboration is demonstrated robots with different capabilities to accomplish the task better in Figure 1, where heterogeneous robots of different capabilities [14, 19, 36, 37, 41]. Meanwhile, the ad hoc teamwork can compose any team, and the original heterogeneous team collaboration is an important problem in the heterogeneous collaborates to execute a task. An ad hoc robot could join this multi-robot collaboration, which has been rarely addressed. Beijing University of Posts and Telecommunications, Beijing, China.


Small Scale Data-Free Knowledge Distillation

arXiv.org Artificial Intelligence

Data-free knowledge distillation is able to utilize the knowledge learned by a large teacher network to augment the training of a smaller student network without accessing the original training data, avoiding privacy, security, and proprietary risks in real applications. In this line of research, existing methods typically follow an inversion-and-distillation paradigm in which a generative adversarial network on-the-fly trained with the guidance of the pre-trained teacher network is used to synthesize a large-scale sample set for knowledge distillation. In this paper, we reexamine this common data-free knowledge distillation paradigm, showing that there is considerable room to improve the overall training efficiency through a lens of ``small-scale inverted data for knowledge distillation". In light of three empirical observations indicating the importance of how to balance class distributions in terms of synthetic sample diversity and difficulty during both data inversion and distillation processes, we propose Small Scale Data-free Knowledge Distillation SSD-KD. In formulation, SSD-KD introduces a modulating function to balance synthetic samples and a priority sampling function to select proper samples, facilitated by a dynamic replay buffer and a reinforcement learning strategy. As a result, SSD-KD can perform distillation training conditioned on an extremely small scale of synthetic samples (e.g., 10X less than the original training data scale), making the overall training efficiency one or two orders of magnitude faster than many mainstream methods while retaining superior or competitive model performance, as demonstrated on popular image classification and semantic segmentation benchmarks. The code is available at https://github.com/OSVAI/SSD-KD.