Goto

Collaborating Authors

 3d-aware image compositing


\textit{Bifr\"ost} : 3D-Aware Image Compositing with Language Instructions

Neural Information Processing Systems

This paper introduces \textit{Bifröst}, a novel 3D-aware framework that is built upon diffusion models to perform instruction-based image composition. Previous methods concentrate on image compositing at the 2D level, which fall short in handling complex spatial relationships ( \textit{e.g.}, occlusion). Our method begins by fine-tuning MLLM with a custom counterfactual dataset to predict 2.5D object locations in complex backgrounds from language instructions. Then, the image-compositing model is uniquely designed to process multiple types of input features, enabling it to perform high-fidelity image compositions that consider occlusion, depth blur, and image harmonization. Extensive qualitative and quantitative evaluations demonstrate that \textit{Bifröst} significantly outperforms existing methods, providing a robust solution for generating realistically composited images in scenarios demanding intricate spatial understanding.