Scaling Properties of Diffusion Models for Perceptual Tasks

Ravishankar, Rahul, Patel, Zeeshan, Rajasegaran, Jathushan, Malik, Jitendra

arXiv.org Artificial Intelligence 

In this paper, we argue that iterative computation with diffusion models offers a powerful paradigm for not only generation but also visual perception tasks. We unify tasks such as depth estimation, optical flow, and amodal segmentation under the framework of image-to-image translation, and show how diffusion models benefit from scaling training and test-time compute for these perceptual tasks. Through a careful analysis of these scaling properties, we formulate computeoptimal training and inference recipes to scale diffusion models for visual perception tasks. Our models achieve competitive performance to state-of-the-art methods using significantly less data and compute. Diffusion models have emerged as powerful techniques for generating images and videos, while showing excellent scaling behaviors. In this paper, we present a unified framework to perform a variety of perceptual tasks -- depth estimation, optical flow estimation, and amodal segmentation -- with a single diffusion model, as illustrated in Figure 1. Previous works such as Marigold (Ke et al., 2024), FlowDiffuser (Luo et al., 2024), and pix2gestalt (Ozguroglu et al., 2024) demonstrate the potential of repurposing image diffusion models for various inverse vision tasks individually.