DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling
–Neural Information Processing Systems
Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global selfattention is often redundant, predominantly capturing local patterns--highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity.
Neural Information Processing Systems
Jun-22-2026, 01:06:54 GMT