Accelerating Parallel Diffusion Model Serving with Residual Compression

Jun-14-2026, 03:13:09 GMT–Neural Information Processing Systems

Diffusion models produce realistic images and videos but require substantial computational resources, necessitating multi-accelerator parallelism for real-time deployment. However, parallel inference introduces significant communication overhead from exchanging large activations between devices, limiting efficiency and scalability. We present CompactFusion, a compression framework that significantly reduces communication while preserving generation quality. Our key observation is that diffusion activations exhibit strong temporal redundancy--adjacent steps produce highly similar activations, saturating bandwidth with near-duplicate data carrying little new information. To address this inefficiency, we seek a more compact representation that encodes only the essential information.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Jun-14-2026, 03:13:09 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning (0.45)