Goto

Collaborating Authors

 Xue, Zhengrong


DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning

arXiv.org Artificial Intelligence

Visuomotor policies have shown great promise in robotic manipulation but often require substantial amounts of human-collected data for effective performance. A key reason underlying the data demands is their limited spatial generalization capability, which necessitates extensive data collection across different object configurations. In this work, we present DemoGen, a low-cost, fully synthetic approach for automatic demonstration generation. Using only one human-collected demonstration per task, DemoGen generates spatially augmented demonstrations by adapting the demonstrated action trajectory to novel object configurations. Visual observations are synthesized by leveraging 3D point clouds as the modality and rearranging the subjects in the scene via 3D editing. Empirically, DemoGen significantly enhances policy performance across a diverse range of real-world manipulation tasks, showing its applicability even in challenging scenarios involving deformable objects, dexterous hand end-effectors, and bimanual platforms. Furthermore, DemoGen can be extended to enable additional out-of-distribution capabilities, including disturbance resistance and obstacle avoidance.


MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning

arXiv.org Artificial Intelligence

Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks. However, current algorithms suffer from low sample efficiency, limiting their practical applicability. In this work, we present MENTOR, a method that improves both the architecture and optimization of RL agents. Specifically, MENTOR replaces the standard multi-layer perceptron (MLP) with a mixture-of-experts (MoE) backbone, enhancing the agent's ability to handle complex tasks by leveraging modular expert learning to avoid gradient conflicts. Furthermore, MENTOR introduces a task-oriented perturbation mechanism, which heuristically samples perturbation candidates containing task-relevant information, leading to more targeted and effective optimization. MENTOR outperforms stateof-the-art methods across three simulation domains--DeepMind Control Suite, Meta-World, and Adroit. Additionally, MENTOR achieves an average of 83% success rate on three challenging real-world robotic manipulation tasks including Peg Insertion, Cable Routing, and Tabletop Golf, which significantly surpasses the success rate of 32% from the current strongest model-free visual RL algorithm. These results underscore the importance of sample efficiency in advancing visual RL for real-world robotics. Experimental videos are available at mentor. Figure 1: MENTOR is validated in real-world tasks. We design three challenging robotic learning tasks for the agent to acquire skills through real-world visual reinforcement learning. MENTOR achieves the most efficient and robust policies compared to the baselines. Despite substantial progress in this field (Kostrikov et al., 2020; Yarats et al., 2021; Schwarzer et al., 2020; Stooke et al., 2021; Laskin et al., 2020a), these methods still suffer from low sample efficiency.


AToM-Bot: Embodied Fulfillment of Unspoken Human Needs with Affective Theory of Mind

arXiv.org Artificial Intelligence

We propose AToM-Bot, a novel task generation and execution framework for proactive robot-human interaction, which leverages the human mental and physical state inference capabilities of the Vision Language Model (VLM) prompted by the Affective Theory of Mind (AToM). Without requiring explicit commands by humans, AToM-Bot proactively generates and follows feasible tasks to improve general human well-being. When around humans, AToM-Bot first detects current human needs based on inferred human states and observations of the surrounding environment. It then generates tasks to fulfill these needs, taking into account its embodied constraints. We designed 16 daily life scenarios spanning 4 common scenes and tasked the same visual stimulus to 59 human subjects and our robot. We used the similarity between human open-ended answers and robot output, and the human satisfaction scores to metric robot performance. AToM-Bot received high human evaluations in need detection (6.42/7, 91.7%), embodied solution (6.15/7, 87.8%) and task execution (6.17/7, 88.1%). We show that AToM-Bot excels in generating and executing feasible plans to fulfill unspoken human needs. Videos and code are available at https://affective-tom-bot.github.io.


RiEMann: Near Real-Time SE(3)-Equivariant Robot Manipulation without Point Cloud Segmentation

arXiv.org Artificial Intelligence

We present RiEMann, an end-to-end near Real-time SE(3)-Equivariant Robot Manipulation imitation learning framework from scene point cloud input. Compared to previous methods that rely on descriptor field matching, RiEMann directly predicts the target poses of objects for manipulation without any object segmentation. RiEMann learns a manipulation task from scratch with 5 to 10 demonstrations, generalizes to unseen SE(3) transformations and instances of target objects, resists visual interference of distracting objects, and follows the near real-time pose change of the target object. The scalable action space of RiEMann facilitates the addition of custom equivariant actions such as the direction of turning the faucet, which makes articulated object manipulation possible for RiEMann. In simulation and real-world 6-DOF robot manipulation experiments, we test RiEMann on 5 categories of manipulation tasks with a total of 25 variants and show that RiEMann outperforms baselines in both task success rates and SE(3) geodesic distance errors on predicted poses (reduced by 68.6%), and achieves a 5.4 frames per second (FPS) network inference speed. Code and video results are available at https://riemann-web.github.io/.


ArrayBot: Reinforcement Learning for Generalizable Distributed Manipulation through Touch

arXiv.org Artificial Intelligence

The notion of robotic manipulation [1, 2] easily invokes the image of a biomimetic robot arm or hand trying to grasp tabletop objects and then rearrange them into desired configurations inferred by exteroceptive sensors such as RGBD cameras. To facilitate this manipulation pipeline, the robot learning community has made tremendous efforts in either how to determine steadier grasping poses in demanding scenarios [3, 4, 5, 6, 7] or how to understand the exteroceptive inputs in a more robust and generalizable way [8, 9, 10, 11, 12, 13]. Acknowledging these progresses, this paper attempts to bypass the challenges in the prevailing pipeline by advocating ArrayBot, a reinforcement learning driven system for distributed manipulation [14], where the objects are manipulated through a great number of actuators with only proprioceptive tactile sensing [15, 16, 17, 18]. Conceptually, the hardware of ArrayBot is a 16 16 array of vertically sliding pillars, each of which can be independently actuated, leading to a 16 16 action space. Functionally, the actuators beneath a tabletop object can support its weight and at the same time cooperate to lift, tilt, and even translate it through proper motion policies.


USEEK: Unsupervised SE(3)-Equivariant 3D Keypoints for Generalizable Manipulation

arXiv.org Artificial Intelligence

Can a robot manipulate intra-category unseen objects in arbitrary poses with the help of a mere demonstration of grasping pose on a single object instance? In this paper, we try to address this intriguing challenge by using USEEK, an unsupervised SE(3)-equivariant keypoints method that enjoys alignment across instances in a category, to perform generalizable manipulation. USEEK follows a teacher-student structure to decouple the unsupervised keypoint discovery and SE(3)-equivariant keypoint detection. With USEEK in hand, the robot can infer the category-level task-relevant object frames in an efficient and explainable manner, enabling manipulation of any intra-category objects from and to any poses. Through extensive experiments, we demonstrate that the keypoints produced by USEEK possess rich semantics, thus successfully transferring the functional knowledge from the demonstration object to the novel ones. Compared with other object representations for manipulation, USEEK is more adaptive in the face of large intra-category shape variance, more robust with limited demonstrations, and more efficient at inference time.


Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning

arXiv.org Artificial Intelligence

Learning generalizable policies that can adapt to unseen environments remains challenging in visual Reinforcement Learning (RL). Existing approaches try to acquire a robust representation via diversifying the appearances of in-domain observations for better generalization. Limited by the specific observations of the environment, these methods ignore the possibility of exploring diverse real-world image datasets. In this paper, we investigate how a visual RL agent would benefit from the off-the-shelf visual representations. Surprisingly, we find that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL. Hence, we propose Pre-trained Image Encoder for Generalizable visual reinforcement learning (PIE-G), a simple yet effective framework that can generalize to the unseen visual scenarios in a zero-shot manner. Extensive experiments are conducted on DMControl Generalization Benchmark, DMControl Manipulation Tasks, Drawer World, and CARLA to verify the effectiveness of PIE-G. Empirical evidence suggests PIE-G improves sample efficiency and significantly outperforms previous state-of-the-art methods in terms of generalization performance. In particular, PIE-G boasts a 55% generalization performance gain on average in the challenging video background setting. Project Page: https://sites.google.com/view/pie-g/home.


BiasedWalk: Learning Global-aware Node Embeddings via Biased Sampling

arXiv.org Artificial Intelligence

Popular node embedding methods such as DeepWalk follow the paradigm of performing random walks on the graph, and then requiring each node to be proximate to those appearing along with it. Though proved to be successful in various tasks, this paradigm reduces a graph with topology to a set of sequential sentences, thus omitting global information. To produce global-aware node embeddings, we propose BiasedWalk, a biased random walk strategy that favors nodes with similar semantics. Empirical evidence suggests BiasedWalk can generally enhance global awareness of the generated embeddings.