mdm
Theoretical Benefit and Limitation of Diffusion Language Model
Diffusion language models have emerged as a new approach for text generation. By enabling the parallel sampling of multiple tokens in each diffusion step, they appear to offer a more efficient alternative to auto-regressive models. However, our observations show that current open-sourced diffusion language models require more sampling steps to achieve comparable accuracy on representative tasks-resulting in even higher inference costs than their auto-regressive counterparts. To investigate whether this is an inherent limitation, we conduct a rigorous theoretical analysis of a widely adopted variant: the Masked Diffusion Model (MDM). Surprisingly, our analysis reveals that the conclusion is highly sensitive to the choice of evaluation metric. Under mild conditions, we prove that when the target is near-optimal perplexity, MDMs can achieve this goal in a constant number of sampling steps, independent of sequence length. This result demonstrates that efficiency can, in principle, be attained without compromising generation quality. However, when targeting low sequence error rate-which is important for assessing the "correctness" of a generated sequence, such as a reasoning chain-we show that in the worst case, the required sampling steps must scale linearly with sequence length, thereby eliminating the efficiency advantage. Our analysis establishes the first theoretical foundation for understanding the comparative strengths and limitations of MDMs, offering practical guidance on when to favor MDMs over auto-regressive models and vice versa.
Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking
Recent masked diffusion models (MDMs) have shown competitive performance compared to autoregressive models (ARMs) for language modeling. While most literature has focused on performance enhancing sampling procedures, efficient sampling from MDMs has been scarcely explored. We make the observation that often a given sequence of partially masked tokens determines the values of multiple unknown tokens deterministically, meaning that a single prediction of a masked model holds additional information unused by standard sampling procedures.
Masked Diffusion Models as Energy Minimization
We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations--kinetic, conditional kinetic, and geodesic energy--are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.
Theoretical Benefit and Limitation of Diffusion Language Model
Diffusion language models have emerged as a new approach for text generation. By enabling the parallel sampling of multiple tokens in each diffusion step, they appear to offer a more efficient alternative to auto-regressive models. However, our observations show that current open-sourced diffusion language models require more sampling steps to achieve comparable accuracy on representative tasks--resulting in even higher inference costs than their auto-regressive counterparts. To investigate whether this is an inherent limitation, we conduct a rigorous theoretical analysis of a widely adopted variant: the Masked Diffusion Model (MDM). Surprisingly, our analysis reveals that the conclusion is highly sensitive to the choice of evaluation metric. Under mild conditions, we prove that when the target is near-optimal perplexity, MDMs can achieve this goal in a constant number of sampling steps, independent of sequence length. This result demonstrates that efficiency can, in principle, be attained without compromising generation quality. However, when targeting low sequence error rate--which is important for assessing the ``correctness of a generated sequence, such as a reasoning chain--we show that in the worst case, the required sampling steps must scale linearly with sequence length, thereby eliminating the efficiency advantage. Our analysis establishes the first theoretical foundation for understanding the comparative strengths and limitations of MDMs, offering practical guidance on when to favor MDMs over the auto-regressive models and vice versa.
Masked Diffusion Models as Energy Minimization
Chen, Sitong, Nie, Shen, Sun, Jiacheng, Feng, Zijin, Li, Zhenguo, Wen, Ji-Rong, Li, Chongxuan
We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations--kinetic, conditional kinetic, and geodesic energy--are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.
Masked Diffusion Models are Secretly Learned-Order Autoregressive Models
Garg, Prateek, Kohli, Bhavya, Sarawagi, Sunita
Masked Diffusion Models (MDMs) have emerged as one of the most promising paradigms for generative modeling over discrete domains. It is known that MDMs effectively train to decode tokens in a random order, and that this ordering has significant performance implications in practice. This observation raises a fundamental question: can we design a training framework that optimizes for a favorable decoding order? We answer this in the affirmative, showing that the continuous-time variational objective of MDMs, when equipped with multivariate noise schedules, can identify and optimize for a decoding order during training. We establish a direct correspondence between decoding order and the multivariate noise schedule and show that this setting breaks invariance of the MDM objective to the noise schedule. Furthermore, we prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders, which establishes them as auto-regressive models with learnable orders.
Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models
Jia, Mengni, Zhou, Mengyu, Liu, Yihao, Jiang, Xiaoxi, Jiang, Guanjun
Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. There has been no theoretical explanation or systematic solution. We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise, while ARMs are only affected by (C). This explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal t sampler that minimizes training variance by sampling harder t values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce (A). Experiments show that compared to standard MDM training, our methods improve accuracy by 7-8% on complex reasoning tasks, while simultaneously reducing run-to-run variability to near ARM levels, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline runs remain below the worst run of our method.
On Powerful Ways to Generate: Autoregression, Diffusion, and Beyond
Yang, Chenxiao, Zhou, Cai, Wipf, David, Li, Zhiyuan
Diffusion language models have recently emerged as a competitive alternative to autoregressive language models. Beyond next-token generation, they are more efficient and flexible by enabling parallel and any-order token generation. However, despite empirical successes, their computational power and fundamental limitations remain poorly understood. In this paper, we formally study whether non-autoregressive generation in Masked Diffusion Models (MDM) enables solving problems beyond the reach of Auto-Regressive Models (ARM). Our results show that MDM with sufficiently large context length is computationally universal with decoding steps matching the optimal parallel time complexity in PRAM. However, when controlling for other factors, MDM's flexibility to generate in any-order does not expand what ARM can already solve. To address this, we propose a new form of generation called any-process generation, which extends MDM with capabilities to remask, insert and delete tokens, allowing self-correction, length-variable editing, and adaptive parallelism. Theoretically and empirically, we demonstrate these capabilities enable scalability to significantly harder reasoning problems that are otherwise intractable for ARM and vanilla MDM. Additionally, they prove essential for generation tasks where objects naturally evolve through non-sequential processes, crucial for extending current LLMs beyond natural language to domains such as coding and science.