Yen-Chen, Lin
Cosmos World Foundation Model Platform for Physical AI
NVIDIA, null, :, null, Agarwal, Niket, Ali, Arslan, Bala, Maciej, Balaji, Yogesh, Barker, Erik, Cai, Tiffany, Chattopadhyay, Prithvijit, Chen, Yongxin, Cui, Yin, Ding, Yifan, Dworakowski, Daniel, Fan, Jiaojiao, Fenzi, Michele, Ferroni, Francesco, Fidler, Sanja, Fox, Dieter, Ge, Songwei, Ge, Yunhao, Gu, Jinwei, Gururani, Siddharth, He, Ethan, Huang, Jiahui, Huffman, Jacob, Jannaty, Pooya, Jin, Jingyi, Kim, Seung Wook, Klár, Gergely, Lam, Grace, Lan, Shiyi, Leal-Taixe, Laura, Li, Anqi, Li, Zhaoshuo, Lin, Chen-Hsuan, Lin, Tsung-Yi, Ling, Huan, Liu, Ming-Yu, Liu, Xian, Luo, Alice, Ma, Qianli, Mao, Hanzi, Mo, Kaichun, Mousavian, Arsalan, Nah, Seungjun, Niverty, Sriharsha, Page, David, Paschalidou, Despoina, Patel, Zeeshan, Pavao, Lindsey, Ramezanali, Morteza, Reda, Fitsum, Ren, Xiaowei, Sabavat, Vasanth Rao Naik, Schmerling, Ed, Shi, Stella, Stefaniak, Bartosz, Tang, Shitao, Tchapmi, Lyne, Tredak, Przemek, Tseng, Wei-Cheng, Varghese, Jibin, Wang, Hao, Wang, Haoxiang, Wang, Heng, Wang, Ting-Chun, Wei, Fangyin, Wei, Xinyue, Wu, Jay Zhangjie, Xu, Jiashu, Yang, Wei, Yen-Chen, Lin, Zeng, Xiaohui, Zeng, Yu, Zhang, Jing, Zhang, Qinsheng, Zhang, Yuxuan, Zhao, Qingqing, Zolkowski, Artur
Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via https://github.com/NVIDIA/Cosmos.
Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement
Simeonov, Anthony, Goyal, Ankit, Manuelli, Lucas, Yen-Chen, Lin, Sarmiento, Alina, Rodriguez, Alberto, Agrawal, Pulkit, Fox, Dieter
We propose a system for rearranging objects in a scene to achieve a desired object-scene placing relationship, such as a book inserted in an open slot of a bookshelf. The pipeline generalizes to novel geometries, poses, and layouts of both scenes and objects, and is trained from demonstrations to operate directly on 3D point clouds. Our system overcomes challenges associated with the existence of many geometrically-similar rearrangement solutions for a given scene. By leveraging an iterative pose de-noising training procedure, we can fit multi-modal demonstration data and produce multi-modal outputs while remaining precise and accurate. We also show the advantages of conditioning on relevant local geometric features while ignoring irrelevant global structure that harms both generalization and precision. We demonstrate our approach on three distinct rearrangement tasks that require handling multi-modality and generalization over object shape and pose in both simulation and the real world. Project website, code, and videos: https://anthonysimeonov.github.io/rpdiff-multi-modal/
MIRA: Mental Imagery for Robotic Affordances
Yen-Chen, Lin, Florence, Pete, Zeng, Andy, Barron, Jonathan T., Du, Yilun, Ma, Wei-Chiu, Simeonov, Anthony, Garcia, Alberto Rodriguez, Isola, Phillip
Humans form mental images of 3D scenes to support counterfactual imagination, planning, and motor control. Our abilities to predict the appearance and affordance of the scene from previously unobserved viewpoints aid us in performing manipulation tasks (e.g., 6-DoF kitting) with a level of ease that is currently out of reach for existing robot learning frameworks. In this work, we aim to build artificial systems that can analogously plan actions on top of imagined images. To this end, we introduce Mental Imagery for Robotic Affordances (MIRA), an action reasoning framework that optimizes actions with novel-view synthesis and affordance prediction in the loop. Given a set of 2D RGB images, MIRA builds a consistent 3D scene representation, through which we synthesize novel orthographic views amenable to pixel-wise affordances prediction for action optimization. We illustrate how this optimization process enables us to generalize to unseen out-of-plane rotations for 6-DoF robotic manipulation tasks given a limited number of demonstrations, paving the way toward machines that autonomously learn to understand the world around them for planning actions.
SE(3)-Equivariant Relational Rearrangement with Neural Descriptor Fields
Simeonov, Anthony, Du, Yilun, Yen-Chen, Lin, Rodriguez, Alberto, Kaelbling, Leslie Pack, Lozano-Perez, Tomas, Agrawal, Pulkit
We present a method for performing tasks involving spatial relations between novel object instances initialized in arbitrary poses directly from point cloud observations. Our framework provides a scalable way for specifying new tasks using only 5-10 demonstrations. Object rearrangement is formalized as the question of finding actions that configure task-relevant parts of the object in a desired alignment. This formalism is implemented in three steps: assigning a consistent local coordinate frame to the task-relevant object parts, determining the location and orientation of this coordinate frame on unseen object instances, and executing an action that brings these frames into the desired alignment. We overcome the key technical challenge of determining task-relevant local coordinate frames from a few demonstrations by developing an optimization method based on Neural Descriptor Fields (NDFs) and a single annotated 3D keypoint. An energy-based learning scheme to model the joint configuration of the objects that satisfies a desired relational task further improves performance. The method is tested on three multi-object rearrangement tasks in simulation and on a real robot. Project website, videos, and code: https://anthonysimeonov.github.io/r-ndf/
Debiased Contrastive Learning
Chuang, Ching-Yao, Robinson, Joshua, Yen-Chen, Lin, Torralba, Antonio, Jegelka, Stefanie
A prominent technique for self-supervised representation learning has been to contrast semantically similar and dissimilar pairs of samples. Without access to labels, dissimilar (negative) points are typically taken to be randomly sampled datapoints, implicitly accepting that these points may, in reality, actually have the same label. Perhaps unsurprisingly, we observe that sampling negative examples from truly different labels improves performance, in a synthetic setting where labels are available. Motivated by this observation, we develop a debiased contrastive objective that corrects for the sampling of same-label datapoints, even without knowledge of the true labels. Empirically, the proposed objective consistently outperforms the state-of-the-art for representation learning in vision, language, and reinforcement learning benchmarks. Theoretically, we establish generalization bounds for the downstream classification task.