CDLM: Consistency Diffusion Language Models For Faster Sampling

Kim, Minseo, Xu, Chenfeng, Hooper, Coleman, Singh, Harman, Athiwaratkun, Ben, Zhang, Ce, Keutzer, Kurt, Gholami, Amir

arXiv.org Artificial Intelligence 

Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.