AITopics | specdec

Collaborating Authors

specdec

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Chen, Zhuokun, Chen, Zeren, He, Jiahao, Sheng, Lu, Tan, Mingkui, Cai, Jianfei, Zhuang, Bohan

arXiv.org Artificial IntelligenceSep-29-2025

Chain-of-thought (CoT) enhances the problem-solving ability of large language models (LLMs) but incurs substantial inference cost due to long autoregressive trajectories. Existing acceleration strategies either shorten traces via early stopping or compression, or adopt speculative decoding with a smaller model. However, speculative decoding provides limited gains when model agreement is low and rigidly enforces token-level consistency, overlooking the observation that some smaller models, when correct, produce significantly more concise reasoning traces that could reduce inference length. We introduce R-Stitch, a training-free hybrid decoding framework that leverages token-level entropy as an uncertainty proxy to delegate computation between a small language model (SLM) and an LLM. Our analysis shows that high-entropy tokens are more likely to induce errors, motivating an entropy-guided routing strategy that lets the SLM efficiently handle low-entropy tokens while delegating uncertain ones to the LLM, thereby avoiding full rollbacks and preserving answer quality. We further extend this design with R-Stitch$^{+}$, which learns an adaptive routing policy to adjust the token budget dynamically beyond fixed thresholds. By jointly reducing per-token decoding complexity and the number of generated tokens, our method achieves substantial acceleration with negligible accuracy loss. Concretely, it attains peak speedups of 3.00$\times$ on DeepSeek-R1-Distill-Qwen-7B, 3.85$\times$ on 14B, and 4.10$\times$ on QWQ-32B while maintaining accuracy comparable to full LLM decoding. Moreover, it naturally enables adaptive efficiency--accuracy trade-offs that can be tailored to diverse computational budgets without retraining.

large language model, natural language, r-stitch 0, (16 more...)

arXiv.org Artificial Intelligence

2507.17307

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Confidence-Modulated Speculative Decoding for Large Language Models

Sen, Jaydip, Dasgupta, Subhasis, Waghela, Hetvi

arXiv.org Artificial IntelligenceAug-26-2025

-- Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft - then - verify paradigm. However, existing methods rely on static drafting lengths and rigid verification cri teria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information - theoretic framework for speculative decoding based on confidence - modulated drafting. By leveraging entropy and margin - based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, an d maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summariza tion tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug - in method for efficient and robust decoding in large language models under v arying conditions of uncertainty. Keywords -- Speculative Decoding, Autoregressive Models, Confidence Estimation, Adaptive Inference, Entropy - Based Drafting, Sequence Generation, Large Language Models, Large Language Models (LLMs), Information - Theoretic Decoding. The task of sequence generation lies at the heart of numerous applications in natural language processing, including machine translation, text summarization, dialogue generation, and code synthesis. In the overwhelming majority of these applications, autor egressive (AR) decoding remains the dominant paradigm for generating sequences from a probabilistic language model [1 - 2] . Autoregressive models, particularly those based on the Transformer architecture, operate by predicting each token conditioned on the e ntire history of previously generated tokens. This left - to - right decoding strategy, though optimal in terms of likelihood estimation, suffers from a fundamental limitation: the inherently sequential nature of generation prohibits efficient parallelization, severely hindering inference throughput, especially in latency - sensitive deployment scenarios.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.15371

Country: Asia > India (0.68)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Xia, Heming, Yang, Zhe, Dong, Qingxiu, Wang, Peiyi, Li, Yongqi, Ge, Tao, Liu, Tianyu, Li, Wenjie, Sui, Zhifang

arXiv.org Artificial IntelligenceJan-15-2024

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first efficiently drafts several future tokens and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, including current leading techniques, the challenges faced, and potential future directions in this field. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference.

inference, speculative decoding, target llm, (12 more...)

arXiv.org Artificial Intelligence

2401.07851

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(3 more...)

Genre: Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

Xia, Heming, Ge, Tao, Wang, Peiyi, Chen, Si-Qing, Wei, Furu, Sui, Zhifang

arXiv.org Artificial IntelligenceOct-29-2023

We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter -- an independent model specially optimized for efficient and accurate drafting -- and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around $5\times$ speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only $1.4\times$$\sim$$2\times$ speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.

decoding, spec-drafter, specdec, (15 more...)

arXiv.org Artificial Intelligence

2203.16487

Country:

Europe > Belgium (0.04)
North America > Canada > Quebec > Montreal (0.04)
North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
(11 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Information Technology > Artificial Intelligence > Natural Language > Generation (0.48)

Add feedback