Guo, Yuan-Chen
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
Li, Yangguang, Zou, Zi-Xin, Liu, Zexiang, Wang, Dehu, Liang, Yuan, Yu, Zhipeng, Liu, Xingchao, Guo, Yuan-Chen, Liang, Ding, Ouyang, Wanli, Cao, Yan-Pei
Recent advancements in diffusion techniques have propelled image and video generation to unprece- dented levels of quality, significantly accelerating the deployment and application of generative AI. However, 3D shape generation technology has so far lagged behind, constrained by limitations in 3D data scale, complexity of 3D data process- ing, and insufficient exploration of advanced tech- niques in the 3D domain. Current approaches to 3D shape generation face substantial challenges in terms of output quality, generalization capa- bility, and alignment with input conditions. We present TripoSG, a new streamlined shape diffu- sion paradigm capable of generating high-fidelity 3D meshes with precise correspondence to input images. Specifically, we propose: 1) A large-scale rectified flow transformer for 3D shape generation, achieving state-of-the-art fidelity through training on extensive, high-quality data. 2) A hybrid supervised training strategy combining SDF, normal, and eikonal losses for 3D VAE, achieving high- quality 3D reconstruction performance. 3) A data processing pipeline to generate 2 million high- quality 3D samples, highlighting the crucial rules for data quality and quantity in training 3D gen- erative models. Through comprehensive experi- ments, we have validated the effectiveness of each component in our new framework. The seamless integration of these parts has enabled TripoSG to achieve state-of-the-art performance in 3D shape generation. The resulting 3D shapes exhibit en- hanced detail due to high-resolution capabilities and demonstrate exceptional fidelity to input im- ages. Moreover, TripoSG demonstrates improved versatility in generating 3D models from diverse image styles and contents, showcasing strong gen- eralization capabilities. To foster progress and innovation in the field of 3D generation, we will make our model publicly available.
TEXGen: a Generative Diffusion Model for Mesh Textures
Yu, Xin, Yuan, Ze, Guo, Yuan-Chen, Liu, Ying-Tian, Liu, JianHui, Li, Yangguang, Cao, Yan-Pei, Liang, Ding, Qi, Xiaojuan
While high-quality texture maps are essential for realistic 3D asset rendering, few studies have explored learning directly in the texture space, especially on large-scale datasets. In this work, we depart from the conventional approach of relying on pre-trained 2D diffusion models for test-time optimization of 3D textures. Instead, we focus on the fundamental problem of learning in the UV texture space itself. For the first time, we train a large diffusion model capable of directly generating high-resolution texture maps in a feed-forward manner. To facilitate efficient learning in high-resolution UV spaces, we propose a scalable network architecture that interleaves convolutions on UV maps with attention layers on point clouds. Leveraging this architectural design, we train a 700 million parameter diffusion model that can generate UV texture maps guided by text prompts and single-view images. Once trained, our model naturally supports various extended applications, including text-guided texture inpainting, sparse-view texture completion, and text-driven texture synthesis. Project page is at http://cvmi-lab.github.io/TEXGen/.
DreamComposer: Controllable 3D Object Generation via Multi-View Conditions
Yang, Yunhan, Huang, Yukun, Wu, Xiaoyang, Guo, Yuan-Chen, Zhang, Song-Hai, Zhao, Hengshuang, He, Tong, Liu, Xihui
Utilizing pre-trained 2D large-scale generative models, recent works are capable of generating high-quality novel views from a single in-the-wild image. However, due to the lack of information from multiple views, these works encounter difficulties in generating controllable novel views. In this paper, we present DreamComposer, a flexible and scalable framework that can enhance existing view-aware diffusion models by injecting multi-view conditions. Specifically, DreamComposer first uses a view-aware 3D lifting module to obtain 3D representations of an object from multiple views. Then, it renders the latent features of the target view from 3D representations with the multi-view feature fusion module. Finally the target view features extracted from multi-view inputs are injected into a pre-trained diffusion model. Experiments show that DreamComposer is compatible with state-of-the-art diffusion models for zero-shot novel view synthesis, further enhancing them to generate high-fidelity novel view images with multi-view conditions, ready for controllable 3D object reconstruction and various other applications.
Text-to-3D with Classifier Score Distillation
Yu, Xin, Guo, Yuan-Chen, Li, Yangguang, Liang, Ding, Zhang, Song-Hai, Qi, Xiaojuan
However, it is still challenging and expensive to create a high-quality 3D asset as it requires a high level of expertise. Therefore, automating this process with generative models has become an important problem, which remains challenging due to the scarcity of data and the complexity of 3D representations. Recently, techniques based on Score Distillation Sampling (SDS) (Poole et al., 2022; Lin et al., 2023; Chen et al., 2023; Wang et al., 2023b), also known as Score Jacobian Chaining (SJC) (Wang et al., 2023a), have emerged as a major research direction for text-to-3D generation, as they can produce high-quality and intricate 3D results from diverse text prompts without requiring 3D data for training. The core principle behind SDS is to optimize 3D representations by encouraging their rendered images to move towards high probability density regions conditioned on the text, where the supervision is provided by a pre-trained 2D diffusion model (Ho et al., 2020; Sohl-Dickstein et al., 2015; Rombach et al., 2022; Saharia et al., 2022; Balaji et al., 2022). DreamFusion (Poole et al., 2022) advocates the use of SDS for the optimization of Neural Radiance Fields (NeRF).
NeRF-SR: High-Quality Neural Radiance Fields using Super-Sampling
Wang, Chen, Wu, Xian, Guo, Yuan-Chen, Zhang, Song-Hai, Tai, Yu-Wing, Hu, Shi-Min
We present NeRF-SR, a solution for high-resolution (HR) novel view synthesis with mostly low-resolution (LR) inputs. Our method is built upon Neural Radiance Fields (NeRF) that predicts per-point density and color with a multi-layer perceptron. While producing images at arbitrary scales, NeRF struggles with resolutions that go beyond observed images. Our key insight is that NeRF has a local prior, which means predictions of a 3D point can be propagated in the nearby region and remain accurate. We first exploit it by a super-sampling strategy that shoots multiple rays at each image pixel, which enforces multi-view constraint at a sub-pixel level. Then, we show that NeRF-SR can further boost the performance of super-sampling by a refinement network that leverages the estimated depth at hand to hallucinate details from related patches on an HR reference image. Experiment results demonstrate that NeRF-SR generates high-quality results for novel view synthesis at HR on both synthetic and real-world datasets.