DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
–Neural Information Processing Systems
This paper's primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model's prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch.
Neural Information Processing Systems
Jun-11-2026, 16:37:35 GMT
- Technology: