diffusion language model
- Asia > Indonesia > Bali (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (2 more...)
Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models
Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models.In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive language models. In contrast to autoregressive language models that make decisions in a left-to-right, token-by-token manner, DoT allows reasoning steps to diffuse over time through a diffusion language model and offers greater flexibility in trading-off computation for reasoning performance. Our experimental results demonstrate the effectiveness of DoT in multi-digit multiplication, boolean logic, and grade school math problems.
A Survey on Diffusion Language Models
Li, Tianyi, Chen, Mingda, Guo, Bowei, Shen, Zhiqiang
A different approach, Reparameter-ized Discrete diffusion Models (RDMs) [62], establishes an alternative formulation for the reverse process, which simplifies the training objective to a weighted cross-entropy loss. This enables more flexible and adaptive decoding strategies, leading to significant performance gains over previous discrete diffusion models. Similarly, MD4 [63] derives a simple weighted integral of cross-entropy losses as the continuous-time variational objective of masked diffusion models, providing a simple and generalized framework for training DLMs. Another analogous approach is MDLM [64], which introduces a simplified, Rao-Blackwellized objective that takes the form of a weighted average of masked language modeling losses. Diffusion-LLM [65] demonstrates the scalability of DLMs by adapting pre-trained masked language models to diffusion paradigm and further task-specific finetuning and instruction finetuning, unlocking their versatility in solving general language tasks. Diffusion-NAT [66] unifies a discrete diffusion model with a PLM by reformulating the denoising process as a non-autoregressive masked token recovery task, allowing BART to act as an effective denoiser. Plaid [67] is the first diffusion language model trained to maximize data likelihood, demonstrating through scaling laws that it can outperform autoregressive models like GPT-2 on standard benchmarks. T o improve the training objective, SEDD [68] introduces a score entropy loss to directly learn the ratios of the data distribution, which serves as a discrete extension of score matching. Reparameterized Absorbing Discrete Diffusion (RADD) [69] reveals that the concrete score in absorbing diffusion can be expressed as a time-independent conditional probability of the clean data, multiplied by an analytic, time-dependent scalar.
- Overview (1.00)
- Research Report > New Finding (0.45)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)
Decoding Large Language Diffusion Models with Foreseeing Movement
Mo, Yichuan, Chen, Quan, Li, Mingjie, Wei, Zeming, Wang, Yisen
Large Language Diffusion Models (LLDMs) benefit from a flexible decoding mechanism that enables parallelized inference and controllable generations over autoregressive models. Yet such flexibility introduces a critical challenge: inference performance becomes highly sensitive to the decoding order of tokens. Existing heuristic methods, however, focus mainly on local effects while overlooking long-term impacts. To address this limitation, we propose the Foreseeing Decoding Method (FDM), a novel approach that integrates both local and global considerations to unlock the full potential, employing a search-based strategy to enable effective optimization in discrete spaces. Furthermore, by analyzing the consistency of chosen tokens in the full decoding process, we develop a variant, FDM with Acceleration (FDM-A), which restricts deep exploration to critical steps identified as the exploration and balance circumantences. Extensive experiments across diverse benchmarks and model architectures validate the scalability of FDM and demonstrate the superior efficiency-performance trade-off achieved by FDM-A. Our work might potentially provide a principled step toward more powerful decoding methods for LLDMs.
- North America > United States > Massachusetts (0.05)
- Europe > Germany (0.04)
WavefrontDiffusion: Dynamic Decoding Schedule for Improved Reasoning
Yang, Haojin, Hu, Rui, Sun, Zequn, Zhou, Rui, Cai, Yujun, Wang, Yiwei
Diffusion Language Models (DLMs) have shown strong potential for text generation and are becoming a competitive alternative to autoregressive models. The denoising strategy plays an important role in determining the quality of their outputs. Mainstream denoising strategies include Standard Diffusion and BlockDiffusion. Standard Diffusion performs global denoising without restricting the update range, often finalizing incomplete context and causing premature end-of-sequence predictions. BlockDiffusion updates fixed-size blocks in a preset order, but its rigid structure can break apart coherent semantic units and disrupt reasoning. We present WavefrontDiffusion, a dynamic decoding approach that expands a wavefront of active tokens outward from finalized positions. This adaptive process follows the natural flow of semantic structure while keeping computational cost equal to block-based methods. Across four benchmarks in reasoning and code generation, WavefrontDiffusion achieves state-of-the-art performance while producing outputs with higher semantic fidelity, showing the value of adaptive scheduling for more coherent and efficient generation. Recent advances in large language models (LLMs) have achieved remarkable progress in complex reasoning and structured generation tasks such as mathematical problem solving and code synthesis (OpenAI et al., 2025; DeepSeek-AI et al., 2025). Autoregressive (AR) models remain the dominant paradigm for these tasks due to their stepwise logical consistency (Deletang et al., 2024). However, their strictly sequential nature introduces latency and limits flexibility, which can be problematic in settings that demand both accuracy and responsiveness, such as interactive assistants or real-time code generation. These limitations have motivated the exploration of alternative decoding paradigms that can balance quality, efficiency, and adaptability (Leviathan et al., 2023). Diffusion Language Models (DLMs) have recently emerged as a promising alternative by framing text generation as an iterative denoising process (Gong et al., 2025; Song et al., 2025).
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Oceania > Australia > Queensland (0.04)
- North America > United States > California (0.04)
- (2 more...)
Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective
Ou, Jingyang, Han, Jiaqi, Xu, Minkai, Xu, Shaoxuan, Xie, Jianwen, Ermon, Stefano, Wu, Yi, Li, Chongxuan
Reinforcement Learning (RL) has proven highly effective for autoregressive language models, but adapting these methods to diffusion large language models (dLLMs) presents fundamental challenges. The core difficulty lies in likelihood approximation: while autoregressive models naturally provide token-level conditional probabilities essential for token-level RL objectives (e.g., GRPO), dLLMs generate sequences through iterative non-autoregressive denoising steps that lack this factorization. To address this fundamental mismatch, we propose ELBO-based Sequence-level Policy Optimization (ESPO), a principled RL framework that treats entire sequence generation as a single action and uses the ELBO as a tractable sequence-level likelihood proxy. Our method incorporates per-token normalization of importance ratios and robust KL-divergence estimation to ensure stable large-scale training. Extensive experiments on mathematical reasoning, coding, and planning tasks demonstrate that ESPO significantly outperforms token-level baselines, achieving dramatic improvements of 20-40 points on the Countdown task, while maintaining consistent gains on math and coding benchmarks. Our approach establishes sequence-level optimization as a principled and empirically effective paradigm for RL in dLLMs. Our code is available at https://github.com/ML-GSAI/ESPO. Large language models (LLMs) (OpenAI, 2023) have become a cornerstone of modern natural language processing, achieving remarkable progress across math (Guo et al., 2025), coding (Hui et al., 2024), and planning tasks (Y ao et al., 2023). While autoregressive (AR) modeling has long dominated this field, recent advances in diffusion large language models (dLLMs) have demonstrated strong potential as an alternative formulation (Ou et al., 2024; Shi et al., 2024; Sahoo et al., 2024; Nie et al., 2025; Y e et al., 2025). With the advent of powerful pretrained dLLMs, the next frontier lies in post-training (Ouyang et al., 2022) to further enhance their capabilities. Among various post-training paradigms, reinforcement learning (RL) has emerged as a powerful approach that enables test-time scaling (Snell et al., 2025) through verifiable rewards (Guo et al., 2025). It has yielded substantial gains on reasoning tasks in recent AR models (OpenAI, 2024), such as math (Cobbe et al., 2021b), coding (Chen et al., 2021), and reasoning (Liu et al., 2023b).
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Canada > British Columbia > Vancouver (0.04)
- North America > United States > Texas > Orange County (0.04)
- (2 more...)
Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules
Mohamed, Amr, Zhang, Yang, Vazirgiannis, Michalis, Shang, Guokan
Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves $3.8$-$4.0\times$ speedups while retaining $99.8$-$100\%$ of the baseline score on average. On base models, SchED yields consistent speedup gains with $99.1$-$100\%$ performance retention, with up to $2.34\times$ under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, $γ{=}4$), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model's token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.
Beyond Confidence: Adaptive and Coherent Decoding for Diffusion Language Models
Chen, Kecheng, Liu, Ziru, Tao, Xijia, Liu, Hui, Fu, Xinyu, Zhang, Suiyun, Tu, Dandan, Kong, Lingpeng, Liu, Rui, Li, Haoliang
Diffusion Language Models (DLMs) have recently achieved significant success due to their any-order generation capabilities. However, existing inference methods typically rely on local, immediate-step metrics such as confidence or entropy which inherently lack a more reliable perspective. This limitation frequently leads to inconsistent sampling trajectories and suboptimal generation quality. To address this, we propose Coherent Contextual Decoding (CCD), a novel inference framework built upon two core innovations. First, CCD employs a trajectory rectification mechanism that leverages historical context to enhance sequence coherence, enabling the early rejection of suboptimal paths. We demonstrate that this mechanism is theoretically equivalent to modeling the consistency of historical steps via the conditional mutual information between context and token predictions. Building on this theoretical insight, we further address the inefficiency of conventional uniform decoding budgets. Instead of rigid allocations based on diffusion steps, we introduce an adaptive sampling strategy that dynamically adjusts the unmasking budget for each step according to our consistency metric. Consequently, our method significantly improves the quality of generation trajectories while accelerating the sampling process. Empirically, our method achieves a simultaneous enhancement in both inference speed and performance across diverse benchmarks on Dream and LLaDA, delivering up to 3.48x speedup alongside 3.91% performance improvement.
- Europe > Austria > Vienna (0.05)
- Europe > Lithuania > Vilnius County > Vilnius (0.05)
- Europe > Poland > Masovia Province > Warsaw (0.05)
- (2 more...)
- Workflow (0.66)
- Research Report (0.50)
- Consumer Products & Services > Travel (0.46)
- Energy (0.46)
Beyond Static Cutoffs: One-Shot Dynamic Thresholding for Diffusion Language Models
Masked diffusion language models (MDLMs) are becoming competitive with their autoregressive counterparts but typically decode with fixed steps and sequential unmasking. To accelerate decoding, recent work such as Fast-dLLM enables parallel decoding via a static global confidence threshold, yet we observe strong block- and step-wise confidence fluctuations and, within a dataset, near-identical confidence trajectories across inputs as measured by cosine similarity. Motivated by these observations, we introduce One-Shot Dynamic Thresholding (OSDT), which calibrates thresholds on a single sequence and applies them to subsequent inputs with negligible overhead. On GPQA, GSM8K, and HumanEval, OSDT attains superior accuracy-throughput trade-offs (+24% tokens/s on GSM8K at the best accuracy, +45% on GPQA with comparable accuracy, and +50% on HumanEval with a modest accuracy gap). Beyond these results, our findings suggest broader opportunities to leverage reusable task-level confidence signatures for more general-purpose algorithmic and systems innovations in diffusion decoding.
From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models
Fu, Hengyu, Huang, Baihe, Adams, Virginia, Wang, Charles, Srinivasan, Venkat, Jiao, Jiantao
Diffusion Language Models (DLMs) have recently emerged as a strong alternative to autoregressive language models (LMs). DLMs offer comparable accuracy with faster inference speed via parallel decoding. However, standard DLM decoding strategies relying on high-confidence tokens encounter an inherent information-theoretic bottleneck that restricts decoding progress and ultimately slows generation. We demonstrate both theoretically and empirically that prioritizing high-confidence tokens is inherently inefficient. High-probability tokens carry negligible information and strictly relying on them limits the effective progress made in each decoding round. We prove that the number of decoding rounds must grow linearly with the sample's total information (negative log-likelihood) and inversely with the per-round information budget, establishing a bits-to-rounds principle. We also propose Explore-Then-Exploit (ETE), a training-free decoding strategy that maximizes information throughput and decoding efficiency. ETE combines cross-block decoding with targeted exploration of high-uncertainty tokens to reshape the conditional distribution and trigger cascades of confident predictions. Experiments verify our theoretical bounds and demonstrate that ETE consistently reduces the required number of decoding rounds compared to confidence-only baselines without compromising generation quality.
- North America > United States > Virginia (0.40)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Mexico > Gulf of Mexico (0.04)