Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Neural Information Processing Systems 

Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into the latent space, which inevitably introduces flying pixels at edges and details.