MC-DiT: Contextual Enhancement via Clean-to-Clean Reconstruction for Masked Diffusion Models

May-27-2025, 12:59:27 GMT–Neural Information Processing Systems

Diffusion Transformer (DiT) is emerging as a cutting-edge trend in the landscape of generative diffusion models for image generation. Recently, masked-reconstruction strategies have been considered to improve the efficiency and semantic consistency in training DiT but suffer from deficiency in contextual information extraction. In this paper, we provide a new insight to reveal that noisy-to-noisy masked-reconstruction harms sufficient utilization of contextual information. We further demonstrate the insight with theoretical analysis and empirical study on the mutual information between unmasked and masked patches. Guided by such insight, we propose a novel training paradigm named MC-DiT for fully learning contextual information via diffusion denoising at different noise variances with clean-to-clean mask-reconstruction.

clean-to-clean reconstruction, contextual enhancement, masked diffusion model, (4 more...)

Neural Information Processing Systems

May-27-2025, 12:59:27 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (1.00)