Bi, Sai
EP-CFG: Energy-Preserving Classifier-Free Guidance
Zhang, Kai, Luan, Fujun, Bi, Sai, Zhang, Jianming
Classifier-free guidance (CFG) (Ho & Salimans, 2022) is widely used in diffusion models (Ho et al., 2020; Song et al., 2020) for text-guided generation, but often leads to over-contrast and oversaturation artifacts. We propose EP-CFG, a simple yet effective CFG solution that preserves the energy distribution of the conditional prediction while maintaining strong semantic alignment. Usually, the CFG strength is around 7-10 (Rombach et al., 2022) in modern text-to-image models for sampling high-quality visuals. However, it is wellknown that such high CFG strength can lead to the well-known over-contrast and over-saturation artifacts (Ho & Salimans, 2022). Concurrent work (Sadat et al., 2024) proposed APG to address oversaturation through update term decomposition.
Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors
Kuang, Zhengfei, Zhang, Tianyuan, Zhang, Kai, Tan, Hao, Bi, Sai, Hu, Yiwei, Xu, Zexiang, Hasan, Milos, Wetzstein, Gordon, Luan, Fujun
We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video--depth and video--normal training data. Instead of relying on large-scale annotated video datasets, we demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints. Our zero-shot training strategy combines state-of-the-art image estimation models based on optical flow smoothness through a hybrid loss function, implemented via a lightweight temporal attention architecture. Applied to leading image models like Depth Anything V2 and Marigold-E2E-FT, our approach significantly improves temporal consistency while maintaining accuracy. Experiments show that our method not only outperforms image-based approaches but also achieves results comparable to state-of-the-art video models trained on large-scale paired video datasets, despite using no such paired video data.
LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias
Jin, Haian, Jiang, Hanwen, Tan, Hao, Zhang, Kai, Bi, Sai, Zhang, Tianyuan, Luan, Fujun, Snavely, Noah, Xu, Zexiang
We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novelview outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods--from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps)--addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). Novel view synthesis is a long-standing challenge in vision and graphics. For decades, the community has generally relied on various 3D inductive biases, incorporating 3D priors and handcrafted structures to simplify the task and improve synthesis quality. Other methods have also built generalizable networks to estimate these representations or directly generate novel-view images in a feed-forward manner, often incorporating additional 3D inductive biases, such as projective epipolar lines or plane-sweep volumes, in their architecture designs (Wang et al., 2021a; Yu et al., 2021; Chen et al., 2021; Suhail et al., 2022b; Charatan et al., 2024; Chen et al., 2024). While effective, these 3D inductive biases inherently limit model flexibility, constraining their adaptability to more diverse and complex scenarios that do not align with predefined priors or handcrafted structures.
RelitLRM: Generative Relightable Radiance for Large Reconstruction Models
Zhang, Tianyuan, Kuang, Zhengfei, Jin, Haian, Xu, Zexiang, Bi, Sai, Tan, Hao, Zhang, He, Hu, Yiwei, Hasan, Milos, Freeman, William T., Zhang, Kai, Luan, Fujun
We propose RelitLRM, a Large Reconstruction Model (LRM) for generating high-quality Gaussian splatting representations of 3D objects under novel illuminations from sparse (4-8) posed images captured under unknown static lighting. Unlike prior inverse rendering methods requiring dense captures and slow optimization, often causing artifacts like incorrect highlights or shadow baking, RelitLRM adopts a feed-forward transformer-based model with a novel combination of a geometry reconstructor and a relightable appearance generator based on diffusion. The model is trained end-to-end on synthetic multi-view renderings of objects under varying known illuminations. This architecture design enables to effectively decompose geometry and appearance, resolve the ambiguity between material and lighting, and capture the multi-modal distribution of shadows and specularity in the relit appearance. We show our sparse-view feed-forward RelitLRM offers competitive relighting results to state-of-the-art dense-view optimization-based baselines while being significantly faster. Our project page is available at: https://relit-lrm.github.io/.
Neural Gaffer: Relighting Any Object via Diffusion
Jin, Haian, Li, Yuan, Luan, Fujun, Xiangli, Yuanbo, Bi, Sai, Zhang, Kai, Xu, Zexiang, Sun, Jin, Snavely, Noah
Single-image relighting is a challenging task that involves reasoning about the complex interplay between geometry, materials, and lighting. Many prior methods either support only specific categories of images, such as portraits, or require special capture conditions, like using a flashlight. Alternatively, some methods explicitly decompose a scene into intrinsic components, such as normals and BRDFs, which can be inaccurate or under-expressive. In this work, we propose a novel end-to-end 2D relighting diffusion model, called Neural Gaffer, that takes a single image of any object and can synthesize an accurate, high-quality relit image under any novel environmental lighting condition, simply by conditioning an image generator on a target environment map, without an explicit scene decomposition. Our method builds on a pre-trained diffusion model, and fine-tunes it on a synthetic relighting dataset, revealing and harnessing the inherent understanding of lighting present in the diffusion model. We evaluate our model on both synthetic and in-the-wild Internet imagery and demonstrate its advantages in terms of generalization and accuracy. Moreover, by combining with other generative methods, our model enables many downstream 2D tasks, such as text-based relighting and object insertion. Our model can also operate as a strong relighting prior for 3D tasks, such as relighting a radiance field.
Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning
Xie, Desai, Li, Jiahao, Tan, Hao, Sun, Xin, Shu, Zhixin, Zhou, Yi, Bi, Sai, Pirk, Sรถren, Kaufman, Arie E.
Multi-view diffusion models, obtained by applying Supervised Finetuning (SFT) to text-to-image diffusion models, have driven recent breakthroughs in text-to-3D research. However, due to the limited size and quality of existing 3D datasets, they still suffer from multi-view inconsistencies and Neural Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view diffusion models can benefit from further Reinforcement Learning Finetuning (RLFT), which allows models to learn from the data generated by themselves and improve beyond their dataset limitations during SFT. To this end, we introduce Carve3D, an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric, to enhance the consistency of multi-view diffusion models. To measure the MRC metric on a set of multi-view images, we compare them with their corresponding NeRF renderings at the same camera viewpoints. The resulting model, which we denote as Carve3DM, demonstrates superior multi-view consistency and NeRF reconstruction quality than existing models. Our results suggest that pairing SFT with Carve3D's RLFT is essential for developing multi-view-consistent diffusion models, mirroring the standard Large Language Model (LLM) alignment pipeline. Our code, training and testing data, and video results are available at: https://desaixie.github.io/carve-3d.
LRM: Large Reconstruction Model for Single Image to 3D
Hong, Yicong, Zhang, Kai, Gu, Jiuxiang, Bi, Sai, Zhou, Yang, Liu, Difan, Liu, Feng, Sunkavalli, Kalyan, Bui, Trung, Tan, Hao
We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds. In contrast to many previous methods that are trained on small-scale datasets such as ShapeNet in a category-specific fashion, LRM adopts a highly scalable transformer-based architecture with 500 million learnable parameters to directly predict a neural radiance field (NeRF) from the input image. We train our model in an end-to-end manner on massive multi-view data containing around 1 million objects, including both synthetic renderings from Objaverse and real captures from MVImgNet. This combination of a high-capacity model and large-scale training data empowers our model to be highly generalizable and produce high-quality 3D reconstructions from various testing inputs including real-world in-the-wild captures and images from generative models. Video demos and interactable 3D meshes can be found on this website: https://yiconghong.me/LRM/.