Goto

Collaborating Authors

 Wen, Chuan


FP3: A 3D Foundation Policy for Robotic Manipulation

arXiv.org Artificial Intelligence

FP3 supports data-efficient fine-tuning for downstream tasks, while demonstrating superior generalizability to unseen environments and novel objects. Abstract --Following its success in natural language processing and computer vision, foundation models that are pre-trained on large-scale multi-task datasets have also shown great potential in robotics. However, most existing robot foundation models rely solely on 2D image observations, ignoring 3D geometric information, which is essential for robots to perceive and reason about the 3D world. In this paper, we introduce FP3, a first denotes equal contribution. FP3 builds on a scalable diffusion transformer architecture and is pre-trained on 60k trajectories with point cloud observations. With the model design and diverse pre-training data, FP3 can be efficiently fine-tuned for downstream tasks while exhibiting strong generalization capabilities. Experiments on real robots demonstrate that with only 80 demonstrations, FP3 is able to learn a new task with over 90% success rates in novel environments with unseen objects, significantly surpassing existing robot foundation models. Visualizations and code are available at: FP3. I NTRODUCTION Learning-based policies have shown great effectiveness in robotic manipulation [6, 80, 12, 75, 36, 3]. However, these learned policies often show limited or even zero generalization capability to unseen scenarios, new objects, and distractors [66]. Additionally, most current methods are trained on single or few tasks[12, 75], requiring a relatively large amount of expert demonstrations (usually about 200 episodes) to learn a new task.


Predictive Inference With Fast Feature Conformal Prediction

arXiv.org Machine Learning

Conformal prediction is widely adopted in uncertainty quantification, due to its post-hoc, distribution-free, and model-agnostic properties. In the realm of modern deep learning, researchers have proposed Feature Conformal Prediction (FCP), which deploys conformal prediction in a feature space, yielding reduced band lengths. However, the practical utility of FCP is limited due to the time-consuming non-linear operations required to transform confidence bands from feature space to output space. In this paper, we introduce Fast Feature Conformal Prediction (FFCP), which features a novel non-conformity score and is convenient for practical applications. FFCP serves as a fast version of FCP, in that it equivalently employs a Taylor expansion to approximate the aforementioned non-linear operations in FCP. Empirical validations showcase that FFCP performs comparably with FCP (both outperforming the vanilla version) while achieving a significant reduction in computational time by approximately 50x. The code is available at https://github.com/ElvisWang1111/FastFeatureCP


Data Scaling Laws in Imitation Learning for Robotic Manipulation

arXiv.org Artificial Intelligence

Data scaling has revolutionized fields like natural language processing and computer vision, providing models with remarkable generalization capabilities. In this paper, we investigate whether similar data scaling laws exist in robotics, particularly in robotic manipulation, and whether appropriate data scaling can yield single-task robot policies that can be deployed zero-shot for any object within the same category in any environment. To this end, we conduct a comprehensive empirical study on data scaling in imitation learning. By collecting data across numerous environments and objects, we study how a policy's generalization performance changes with the number of training environments, objects, and demonstrations. Throughout our research, we collect over 40,000 demonstrations and execute more than 15,000 real-world robot rollouts under a rigorous evaluation protocol. Our findings reveal several intriguing results: the generalization performance of the policy follows a roughly power-law relationship with the number of environments and objects. The diversity of environments and objects is far more important than the absolute number of demonstrations; once the number of demonstrations per environment or object reaches a certain threshold, additional demonstrations have minimal effect. Based on these insights, we propose an efficient data collection strategy. With four data collectors working for one afternoon, we collect sufficient data to enable the policies for two tasks to achieve approximately 90% success rates in novel environments with unseen objects.


Can Transformers Capture Spatial Relations between Objects?

arXiv.org Artificial Intelligence

Spatial relationships between objects represent key scene information for humans to understand and interact with the world. To study the capability of current computer vision systems to recognize physically grounded spatial relations, we start by proposing precise relation definitions that permit consistently annotating a benchmark dataset. Despite the apparent simplicity of this task relative to others in the recognition literature, we observe that existing approaches perform poorly on this benchmark. We propose new approaches exploiting the long-range attention capabilities of transformers for this task, and evaluating key design principles. We identify a simple "RelatiViT" architecture and demonstrate that it outperforms all current approaches. To our knowledge, this is the first method to convincingly outperform naive baselines on spatial relation prediction in in-the-wild settings. The code and datasets are available in \url{https://sites.google.com/view/spatial-relation}.


Imitation Learning from Observation with Automatic Discount Scheduling

arXiv.org Artificial Intelligence

Humans often acquire new skills through observation and imitation. For robotic agents, learning from the plethora of unlabeled video demonstration data available on the Internet necessitates imitating the expert without access to its action, presenting a challenge known as Imitation Learning from Observations (ILfO). A common approach to tackle ILfO problems is to convert them into inverse reinforcement learning problems, utilizing a proxy reward computed from the agent's and the expert's observations. Nonetheless, we identify that tasks characterized by a progress dependency property pose significant challenges for such approaches; in these tasks, the agent needs to initially learn the expert's preceding behaviors before mastering the subsequent ones. Our investigation reveals that the main cause is that the reward signals assigned to later steps hinder the learning of initial behaviors. To address this challenge, we present a novel ILfO framework that enables the agent to master earlier behaviors before advancing to later ones. We introduce an Automatic Discount Scheduling (ADS) mechanism that adaptively alters the discount factor in reinforcement learning during the training phase, prioritizing earlier rewards initially and gradually engaging later rewards only when the earlier behaviors have been mastered. Our experiments, conducted on nine Meta-World tasks, demonstrate that our method significantly outperforms state-of-the-art methods across all tasks, including those that are unsolvable by them.


General Flow as Foundation Affordance for Scalable Robot Learning

arXiv.org Artificial Intelligence

Figure 1: We propose General Flow as Foundation Affordance. Its properties and applications are analyzed to reveal its great power. We design a scale-aware algorithm for general flow prediction and achieve stable zero-shot cross-embodiment skill transfer in the real world. Abstract--We address the challenge of acquiring real-world guidance, thus facilitating stable zero-shot skill transfer in realworld manipulation skills with a scalable framework. We deploy our method with a policy based on the success of large-scale auto-regressive prediction in Large closed-loop flow prediction. Remarkably, without any additional Language Models (LLMs), we hold the belief that identifying training, our method achieves an impressive 81% success rate an appropriate prediction target capable of leveraging largescale in human-to-robot skill transfer, covering 18 tasks in 6 scenes. Therefore, we propose to utilize flow, which represents leveraging cross-embodiment data resources; (2) universality: the future trajectories of 3D points on objects of interest, as an multiple object categories, including rigid, articulated, and soft ideal prediction target in robot learning. To exploit scalable data bodies; (3) stable skill transfer: providing actionable guidance resources, we turn our attention to cross-embodiment datasets. These lead to a new pathway We develop, for the first time, a language-conditioned prediction towards scalable general robot learning. We first develop pipelines to extract 3D flow labels We aim to reveal a potential pathway for replicating the directly from RGBD human video datasets. We find prediction success of Large Language Models (LLMs) in the domain of of dense flow in real-world scene point clouds remains robot learning. Specifically, we are interested in developing a formidable challenge, primarily due to the variability of a new framework that enables scalable learning for robot trajectory scales and the need to enhance robustness in zeroshot manipulation. To address these issues, we employ scale-aware the future, this framework has the potential to progressively strategies in both the data and model aspects, complemented enhance the capabilities of robots, i.e., the scaling law that has by augmentation techniques that focus on embodiment occlusion been observed in LLMs [82]. Inspired by the LLMs training (human hand and robot arm) and query point sampling paradigm [14], we believe that two key elements contribute to (3D points on objects of interest), thereby boosting zero-shot their strong generalization abilities: (1) a vast training dataset stability.


Any-point Trajectory Modeling for Policy Learning

arXiv.org Artificial Intelligence

Learning from demonstration is a powerful method for teaching robots new skills, and more demonstration data often improves policy learning. However, the high cost of collecting demonstration data is a significant bottleneck. Videos, as a rich data source, contain knowledge of behaviors, physics, and semantics, but extracting control-specific information from them is challenging due to the lack of action labels. In this work, we introduce a novel framework, Any-point Trajectory Modeling (ATM), that utilizes video demonstrations by pre-training a trajectory model to predict future trajectories of arbitrary points within a video frame. Once trained, these trajectories provide detailed control guidance, enabling the learning of robust visuomotor policies with minimal action-labeled data. Our method's effectiveness is demonstrated across 130 simulation tasks, focusing on language-conditioned manipulation tasks. Visualizations and code are available at: \url{https://xingyu-lin.github.io/atm}.


Predictive Inference with Feature Conformal Prediction

arXiv.org Artificial Intelligence

Conformal prediction is a distribution-free technique for establishing valid prediction intervals. Although conventionally people conduct conformal prediction in the output space, this is not the only possibility. In this paper, we propose feature conformal prediction, which extends the scope of conformal prediction to semantic feature spaces by leveraging the inductive bias of deep representation learning. From a theoretical perspective, we demonstrate that feature conformal prediction provably outperforms regular conformal prediction under mild assumptions. Our approach could be combined with not only vanilla conformal prediction, but also other adaptive conformal prediction methods. Apart from experiments on existing predictive inference benchmarks, we also demonstrate the state-of-the-art performance of the proposed methods on large-scale tasks such as ImageNet classification and Cityscapes image segmentation.The code is available at \url{https://github.com/AlvinWen428/FeatureCP}.