Deng, Kangle
Depth-supervised NeRF: Fewer Views and Faster Training for Free
Deng, Kangle, Liu, Andrew, Zhu, Jun-Yan, Ramanan, Deva
A commonly observed failure mode of Neural Radiance Field (NeRF) is fitting incorrect geometries when given an insufficient number of input views. One potential reason is that standard volumetric rendering does not enforce the constraint that most of a scene's geometry consist of empty space and opaque surfaces. We formalize the above assumption through DS-NeRF (Depth-supervised Neural Radiance Fields), a loss for learning radiance fields that takes advantage of readily-available depth supervision. We leverage the fact that current NeRF pipelines require images with known camera poses that are typically estimated by running structure-from-motion (SFM). Crucially, SFM also produces sparse 3D points that can be used as "free" depth supervision during training: we add a loss to encourage the distribution of a ray's terminating depth matches a given 3D keypoint, incorporating depth uncertainty. DS-NeRF can render better images given fewer training views while training 2-3x faster. Further, we show that our loss is compatible with other recently proposed NeRF methods, demonstrating that depth is a cheap and easily digestible supervisory signal. And finally, we find that DS-NeRF can support other types of depth supervision such as scanned depth sensors and RGB-D reconstruction outputs.
FlashTex: Fast Relightable Mesh Texturing with LightControlNet
Deng, Kangle, Omernick, Timothy, Weiss, Alexander, Ramanan, Deva, Zhu, Jun-Yan, Zhou, Tinghui, Agrawala, Maneesh
Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. We introduce LightControlNet, a new text-to-image model based on the ControlNet architecture, which allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. Our algorithm is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures.
StableLego: Stability Analysis of Block Stacking Assembly
Liu, Ruixuan, Deng, Kangle, Wang, Ziwei, Liu, Changliu
Recent advancements in robotics enable robots to accomplish complex assembly tasks. However, designing an assembly requires a non-trivial effort since a slight variation in the design could significantly affect the task feasibility. It is critical to ensure the physical feasibility of the assembly design so that the assembly task can be successfully executed. To address the challenge, this paper studies the physical stability of assembly structures, in particular, block stacking assembly, where people use cubic blocks to build 3D structures (e.g., Lego constructions). The paper proposes a new optimization formulation, which optimizes over force balancing equations, for inferring the structural stability of 3D block-stacking structures. The proposed stability analysis is tested and verified on hand-crafted Lego examples. The experiment results demonstrate that the proposed stability analysis can correctly predict whether the structure is stable. In addition, it outperforms the existing methods since it can locate the weakest parts in the design, and more importantly, solve any given assembly structure. To further validate the proposed analysis formulation, we provide StableLego: a comprehensive dataset including more than 50k 3D objects with their Lego layouts. We test the proposed stability analysis and include the stability inference for each corresponding object in StableLego. Our code and the dataset are available at https://github.com/intelligent-control-lab/StableLego.
Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis
Song, Chonghyuk, Yang, Gengshan, Deng, Kangle, Zhu, Jun-Yan, Ramanan, Deva
We explore the task of embodied view synthesis from monocular videos of deformable scenes. Given a minute-long RGBD video of people interacting with their pets, we render the scene from novel camera trajectories derived from the in-scene motion of actors: (1) egocentric cameras that simulate the point of view of a target actor and (2) 3rd-person cameras that follow the actor. Building such a system requires reconstructing the root-body and articulated motion of every actor, as well as a scene representation that supports free-viewpoint synthesis. Longer videos are more likely to capture the scene from diverse viewpoints (which helps reconstruction) but are also more likely to contain larger motions (which complicates reconstruction). To address these challenges, we present Total-Recon, the first method to photorealistically reconstruct deformable scenes from long monocular RGBD videos. Crucially, to scale to long videos, our method hierarchically decomposes the scene into the background and objects, whose motion is decomposed into carefully initialized root-body motion and local articulations. To quantify such "in-the-wild" reconstruction and view synthesis, we collect ground-truth data from a specialized stereo RGBD capture rig for 11 challenging videos, significantly outperforming prior methods. Our code, model, and data can be found at https://andrewsonga.github.io/totalrecon .
3D-aware Conditional Image Synthesis
Deng, Kangle, Yang, Gengshan, Ramanan, Deva, Zhu, Jun-Yan
We propose pix2pix3D, a 3D-aware conditional generative model for controllable photorealistic image synthesis. Given a 2D label map, such as a segmentation or edge map, our model learns to synthesize a corresponding image from different viewpoints. To enable explicit 3D user control, we extend conditional generative models with neural radiance fields. Given widely-available monocular images and label map pairs, our model learns to assign a label to every 3D point in addition to color and density, which enables it to render the image and pixel-aligned label map simultaneously. Finally, we build an interactive system that allows users to edit the label map from any viewpoint and generate outputs accordingly.