Chen, Mike
Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control
NVIDIA, null, :, null, Alhaija, Hassan Abu, Alvarez, Jose, Bala, Maciej, Cai, Tiffany, Cao, Tianshi, Cha, Liz, Chen, Joshua, Chen, Mike, Ferroni, Francesco, Fidler, Sanja, Fox, Dieter, Ge, Yunhao, Gu, Jinwei, Hassani, Ali, Isaev, Michael, Jannaty, Pooya, Lan, Shiyi, Lasser, Tobias, Ling, Huan, Liu, Ming-Yu, Liu, Xian, Lu, Yifan, Luo, Alice, Ma, Qianli, Mao, Hanzi, Ramos, Fabio, Ren, Xuanchi, Shen, Tianchang, Tang, Shitao, Wang, Ting-Chun, Wu, Jay, Xu, Jiashu, Xu, Stella, Xie, Kevin, Ye, Yuchong, Yang, Xiaodong, Zeng, Xiaohui, Zeng, Yu
We introduce Cosmos-Transfer1, a conditional world generation model that can generate world simulations based on multiple spatial control inputs of various modalities such as segmentation, depth, and edge. In the design, the spatial conditional scheme is adaptive and customizable. It allows weighting different conditional inputs differently at different spatial locations. This enables highly controllable world generation and finds use in various world-to-world transfer use cases, including Sim2Real. We conduct extensive evaluations to analyze the proposed model and demonstrate its applications for Physical AI, including robotics Sim2Real and autonomous vehicle data enrichment. We further demonstrate an inference scaling strategy to achieve real-time world generation with an NVIDIA GB200 NVL72 rack.
InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
Lu, Yifan, Ren, Xuanchi, Yang, Jiawei, Shen, Tianchang, Wu, Zhangjie, Gao, Jun, Wang, Yue, Chen, Siheng, Chen, Mike, Fidler, Sanja, Huang, Jiahui
Previous methods for scene generation either suffer from limited scales or lack geometric and appearance Generating simulatable and controllable 3D scenes is an essential consistency along generated sequences. In contrast, task for a wide spectrum of applications, including we leverage the recent advancements in scalable 3D mixed reality, robotics, and the training and testing of autonomous representation and video models to achieve large dynamic vehicles (AV) [25, 33]. In particular, the requirements scene generation that allows flexible controls through HD of AV applications have introduced new challenges maps, vehicle bounding boxes, and text descriptions. First, for 3D generative models in driving scenarios, posing the we construct a map-conditioned sparse-voxel-based 3D following key desiderata: (1) fidelity and consistency, to generative model to unleash its power for unbounded voxel ensure that the generated scenes support photo-realistic rendering world generation. Then, we re-purpose a video model and while preserving consistent appearance and geometry ground it on the voxel world through a set of carefully designed for reliable and stable physics simulation; (2) largescale, pixel-aligned guidance buffers, synthesizing a consistent to generate scenes at map-level for traffic simulation; appearance. Finally, we propose a fast feed-forward and (3) controllability, to allow easy manipulation of the approach that employs both voxel and pixel branches to lift scene layout, appearance, and ego-car behaviors for curating the dynamic videos to dynamic 3D Gaussians with control-adversarial scenarios.
SCube: Instant Large-Scale Scene Reconstruction using VoxSplats
Ren, Xuanchi, Lu, Yifan, Liang, Hanxue, Wu, Zhangjie, Ling, Huan, Chen, Mike, Fidler, Sanja, Williams, Francis, Huang, Jiahui
We present SCube, a novel method for reconstructing large-scale 3D scenes (geometry, appearance, and semantics) from a sparse set of posed images. Our method encodes reconstructed scenes using a novel representation VoxSplat, which is a set of 3D Gaussians supported on a high-resolution sparse-voxel scaffold. To reconstruct a VoxSplat from images, we employ a hierarchical voxel latent diffusion model conditioned on the input images followed by a feedforward appearance prediction model. The diffusion model generates high-resolution grids progressively in a coarse-to-fine manner, and the appearance network predicts a set of Gaussians within each voxel. From as few as 3 non-overlapping input images, SCube can generate millions of Gaussians with a 1024^3 voxel grid spanning hundreds of meters in 20 seconds. Past works tackling scene reconstruction from images either rely on per-scene optimization and fail to reconstruct the scene away from input views (thus requiring dense view coverage as input) or leverage geometric priors based on low-resolution models, which produce blurry results. In contrast, SCube leverages high-resolution sparse networks and produces sharp outputs from few views. We show the superiority of SCube compared to prior art using the Waymo self-driving dataset on 3D reconstruction and demonstrate its applications, such as LiDAR simulation and text-to-scene generation.
Changing the Narrative Perspective: From Deictic to Anaphoric Point of View
Chen, Mike, Bunescu, Razvan
We introduce the task of changing the narrative point of view, where characters are assigned a narrative perspective that is different from the one originally used by the writer. The resulting shift in the narrative point of view alters the reading experience and can be used as a tool in fiction writing or to generate types of text ranging from educational to self-help and self-diagnosis. We introduce a benchmark dataset containing a wide range of types of narratives annotated with changes in point of view from deictic (first or second person) to anaphoric (third person) and describe a pipeline for processing raw text that relies on a neural architecture for mention selection. Evaluations on the new benchmark dataset show that the proposed architecture substantially outperforms the baselines by generating mentions that are less ambiguous and more natural.