Wen, Bowen
Any6D: Model-free 6D Pose Estimation of Novel Objects
Lee, Taeyeop, Wen, Bowen, Kang, Minjun, Kang, Gyuree, Kweon, In So, Yoon, Kuk-Jin
We introduce Any6D, a model-free framework for 6D object pose estimation that requires only a single RGB-D anchor image to estimate both the 6D pose and size of unknown objects in novel scenes. Unlike existing methods that rely on textured 3D models or multiple viewpoints, Any6D leverages a joint object alignment process to enhance 2D-3D alignment and metric scale estimation for improved pose accuracy. Our approach integrates a render-and-compare strategy to generate and refine pose hypotheses, enabling robust performance in scenarios with occlusions, non-overlapping views, diverse lighting conditions, and large cross-environment variations. We evaluate our method on five challenging datasets: REAL275, Toyota-Light, HO3D, YCBINEOAT, and LM-O, demonstrating its effectiveness in significantly outperforming state-of-the-art methods for novel object pose estimation. Project page: https://taeyeop.com/any6d
FoundationStereo: Zero-Shot Stereo Matching
Wen, Bowen, Trepte, Matthew, Aribido, Joseph, Kautz, Jan, Gallo, Orazio, Birchfield, Stan
Tremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization - a hallmark of foundation models in other computer vision tasks - remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation. Project page: https://nvlabs.github.io/FoundationStereo/
SPOT: SE(3) Pose Trajectory Diffusion for Object-Centric Manipulation
Hsu, Cheng-Chun, Wen, Bowen, Xu, Jie, Narang, Yashraj, Wang, Xiaolong, Zhu, Yuke, Biswas, Joydeep, Birchfield, Stan
We introduce SPOT, an object-centric imitation learning framework. The key idea is to capture each task by an object-centric representation, specifically the SE(3) object pose trajectory relative to the target. This approach decouples embodiment actions from sensory inputs, facilitating learning from various demonstration types, including both action-based and action-less human hand demonstrations, as well as cross-embodiment generalization. Additionally, object pose trajectories inherently capture planning constraints from demonstrations without the need for manually crafted rules. To guide the robot in executing the task, the object trajectory is used to condition a diffusion policy. We show improvement compared to prior work on RLBench simulated tasks. In real-world evaluation, using only eight demonstrations shot on an iPhone, our approach completed all tasks while fully complying with task constraints. Project page: https://nvlabs.github.io/object_centric_diffusion
SkillMimicGen: Automated Demonstration Generation for Efficient Skill Learning and Deployment
Garrett, Caelan, Mandlekar, Ajay, Wen, Bowen, Fox, Dieter
Imitation learning from human demonstrations is an effective paradigm for robot manipulation, but acquiring large datasets is costly and resource-intensive, especially for long-horizon tasks. To address this issue, we propose SkillMimicGen (SkillGen), an automated system for generating demonstration datasets from a few human demos. SkillGen segments human demos into manipulation skills, adapts these skills to new contexts, and stitches them together through free-space transit and transfer motion. We also propose a Hybrid Skill Policy (HSP) framework for learning skill initiation, control, and termination components from SkillGen datasets, enabling skills to be sequenced using motion planning at test-time. We demonstrate that SkillGen greatly improves data generation and policy learning performance over a state-of-the-art data generation framework, resulting in the capability to produce data for large scene variations, including clutter, and agents that are on average 24% more successful. We demonstrate the efficacy of SkillGen by generating over 24K demonstrations across 18 task variants in simulation from just 60 human demonstrations, and training proficient, often near-perfect, HSP agents. Finally, we apply SkillGen to 3 real-world manipulation tasks and also demonstrate zero-shot sim-to-real transfer on a long-horizon assembly task. Videos, and more at https://skillgen.github.io.
AutoMate: Specialist and Generalist Assembly Policies over Diverse Geometries
Tang, Bingjie, Akinola, Iretiayo, Xu, Jie, Wen, Bowen, Handa, Ankur, Van Wyk, Karl, Fox, Dieter, Sukhatme, Gaurav S., Ramos, Fabio, Narang, Yashraj
Robotic assembly for high-mixture settings requires adaptivity to diverse parts and poses, which is an open challenge. Meanwhile, in other areas of robotics, large models and sim-to-real have led to tremendous progress. Inspired by such work, we present AutoMate, a learning framework and system that consists of 4 parts: 1) a dataset of 100 assemblies compatible with simulation and the real world, along with parallelized simulation environments for policy learning, 2) a novel simulation-based approach for learning specialist (i.e., part-specific) policies and generalist (i.e., unified) assembly policies, 3) demonstrations of specialist policies that individually solve 80 assemblies with 80% or higher success rates in simulation, as well as a generalist policy that jointly solves 20 assemblies with an 80%+ success rate, and 4) zero-shot sim-to-real transfer that achieves similar (or better) performance than simulation, including on perception-initialized assembly. The key methodological takeaway is that a union of diverse algorithms from manufacturing engineering, character animation, and time-series analysis provides a generic and robust solution for a diverse range of robotic assembly problems.To our knowledge, AutoMate provides the first simulation-based framework for learning specialist and generalist policies over a wide range of assemblies, as well as the first system demonstrating zero-shot sim-to-real transfer over such a range.
NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows
Tang, Zhenggang, Ren, Zhongzheng, Zhao, Xiaoming, Wen, Bowen, Tremblay, Jonathan, Birchfield, Stan, Schwing, Alexander
We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points, we introduce a novel correspondence algorithm that first matches RGB-based pairs, then leverages multi-view information and 3D reprojection to robustly filter false positives in two steps. We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation. Our dataset ( https://github.com/nerfdeformer/nerfdeformer ) contains 113 synthetic scenes leveraging 47 3D assets. We show that our proposed method outperforms NeRF editing methods as well as diffusion-based methods, and we also explore different methods for filtering correspondences.
Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects
Weng, Yijia, Wen, Bowen, Tremblay, Jonathan, Blukis, Valts, Fox, Dieter, Guibas, Leonidas, Birchfield, Stan
We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. We decompose the problem into two stages, each addressing distinct aspects. Our method first reconstructs object-level shape at each state, then recovers the underlying articulation model including part segmentation and joint articulations that associate the two states. By explicitly modeling point-level correspondences and exploiting cues from images, 3D reconstructions, and kinematics, our method yields more accurate and stable results compared to prior work. It also handles more than one movable part and does not rely on any object shape or structure priors. Project page: https://github.com/NVlabs/DigitalTwinArt
Localization and Perception for Control of a Low Speed Autonomous Shuttle in a Campus Pilot Deployment
Wen, Bowen
Future SAE Level 4 and Level 5 autonomous vehicles will require novel applications of localization, perception, control and artificial intelligence technology in order to offer innovative and disruptive solutions to current mobility problems. Accurate localization is essential for self driving vehicle navigation in GPS inaccessible environments. This thesis concentrates on low speed autonomous shuttles that are mainly utilized for university campus intelligent transportation systems and presents initial results of ongoing work on developing solutions to the localization and perception challenges of a university planned pilot deployment orientated application. The paper treats autonomous driving with real time kinematics GPS (Global Positioning Systems) with an inertial measurement unit (IMU), combined with simultaneous localization and mapping (SLAM) with threedimensional light detection and ranging (LIDAR) sensor, which provides solutions to scenarios where GPS is not available or a lower cost and hence lower accuracy GPS is desirable. The in-house automated low speed electric vehicle from the Automated Driving Lab is used in experimental evaluation and verification. An improved version of Hector SLAM was implemented on ROS and compared with high resolution GPS aided localization framework in the same hardware architecture. The overall configuration that combines ROS with DSpace controller can be easily transplantable prototype in other ii hardware architectures for future similar research. Real-world experiments that are reported here have been conducted in a small test area close to the Ohio State University AV pilot test route.
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
Wen, Bowen, Yang, Wei, Kautz, Jan, Birchfield, Stan
We present FoundationPose, a unified foundation model for 6D object pose estimation and tracking, supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without fine-tuning, as long as its CAD model is given, or a small number of reference images are captured. We bridge the gap between these two setups with a neural implicit representation that allows for effective novel view synthesis, keeping the downstream pose estimation modules invariant under the same unified framework. Strong generalizability is achieved via large-scale synthetic training, aided by a large language model (LLM), a novel transformer-based architecture, and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition, it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations
Mandlekar, Ajay, Nasiriany, Soroush, Wen, Bowen, Akinola, Iretiayo, Narang, Yashraj, Fan, Linxi, Zhu, Yuke, Fox, Dieter
Imitation learning from a large set of human demonstrations has proved to be an effective paradigm for building capable robot agents. However, the demonstrations can be extremely costly and time-consuming to collect. We introduce MimicGen, a system for automatically synthesizing large-scale, rich datasets from only a small number of human demonstrations by adapting them to new contexts. We use MimicGen to generate over 50K demonstrations across 18 tasks with diverse scene configurations, object instances, and robot arms from just ~200 human demonstrations. We show that robot agents can be effectively trained on this generated dataset by imitation learning to achieve strong performance in long-horizon and high-precision tasks, such as multi-part assembly and coffee preparation, across broad initial state distributions. We further demonstrate that the effectiveness and utility of MimicGen data compare favorably to collecting additional human demonstrations, making it a powerful and economical approach towards scaling up robot learning. Datasets, simulation environments, videos, and more at https://mimicgen.github.io .