specdec
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning
Chen, Zhuokun, Chen, Zeren, He, Jiahao, Sheng, Lu, Tan, Mingkui, Cai, Jianfei, Zhuang, Bohan
Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency--accuracy trade-offs that can be tailored to diverse computational budgets without retraining.
Confidence-Modulated Speculative Decoding for Large Language Models
Sen, Jaydip, Dasgupta, Subhasis, Waghela, Hetvi
-- Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft - then - verify paradigm. However, existing methods rely on static drafting lengths and rigid verification cri teria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information - theoretic framework for speculative decoding based on confidence - modulated drafting. By leveraging entropy and margin - based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, an d maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summariza tion tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug - in method for efficient and robust decoding in large language models under v arying conditions of uncertainty. Keywords -- Speculative Decoding, Autoregressive Models, Confidence Estimation, Adaptive Inference, Entropy - Based Drafting, Sequence Generation, Large Language Models, Large Language Models (LLMs), Information - Theoretic Decoding. The task of sequence generation lies at the heart of numerous applications in natural language processing, including machine translation, text summarization, dialogue generation, and code synthesis. In the overwhelming majority of these applications, autor egressive (AR) decoding remains the dominant paradigm for generating sequences from a probabilistic language model [1 - 2] . Autoregressive models, particularly those based on the Transformer architecture, operate by predicting each token conditioned on the e ntire history of previously generated tokens. This left - to - right decoding strategy, though optimal in terms of likelihood estimation, suffers from a fundamental limitation: the inherently sequential nature of generation prohibits efficient parallelization, severely hindering inference throughput, especially in latency - sensitive deployment scenarios.
Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding
Xia, Heming, Yang, Zhe, Dong, Qingxiu, Wang, Peiyi, Li, Yongqi, Ge, Tao, Liu, Tianyu, Li, Wenjie, Sui, Zhifang
To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first efficiently drafts several future tokens and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, including current leading techniques, the challenges faced, and potential future directions in this field. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation
Xia, Heming, Ge, Tao, Wang, Peiyi, Chen, Si-Qing, Wei, Furu, Sui, Zhifang
We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter -- an independent model specially optimized for efficient and accurate drafting -- and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around $5\times$ speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only $1.4\times$$\sim$$2\times$ speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.