GenXD: Generating Any 3D and 4D Scenes

Zhao, Yuyang, Lin, Chung-Ching, Lin, Kevin, Yan, Zhiwen, Li, Linjie, Yang, Zhengyuan, Wang, Jianfeng, Lee, Gim Hee, Wang, Lijuan

arXiv.org Artificial Intelligence 

Figure 1: GenX D is a unified model for high-quality 3D and 4D generation from any number of condition images. By controlling the motion strength and condition masks, GenX D can support various application without any modification. The condition images are shown with star icon and the time dimension is illustrated with dash line. Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenX D, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenX D employs masked latent conditions to support a variety of conditioning views. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenX D's effectiveness and versatility compared to previous methods in 3D and 4D generation. The dataset and code will be made publicly available.