drafter
Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match
Li, Jinze, Xu, Yixing, Li, Guanchen, Yang, Shuo, Xu, Jinfeng, Yin, Xuanwu, Li, Dong, Ngai, Edith C. H., Barsoum, Emad
Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens in parallel from a smaller draft model, yet its strict exact-match verification discards many semantically valid continuations. Moreover, existing training-based SPD methods often suffer from performance degradation on out-of-distribution (OOD) tasks. To this end, we propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model's self-corrective behavior to judge whether a draft-target mismatch remains semantically valid. FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants. To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself. Owing to its training-free design, FLy composes seamlessly with arbitrary draft-target pairs and generalizes across models and domains without hyperparameter re-tuning. Experiments show that FLy preserves more than 99% of the target model's accuracy while achieving an average 2.81x speedup on Llama-3.1-70B-Instruct and 5.07x speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62x.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- North America > Canada (0.04)
- (2 more...)
HiViS: Hiding Visual Tokens from the Drafter for Speculative Decoding in Vision-Language Models
Xie, Zhinan, Wang, Peisong, Qiu, Shuang, Cheng, Jian
Speculative decoding has proven effective for accelerating inference in Large Language Models (LLMs), yet its extension to Vision-Language Models (VLMs) remains limited by the computational burden and semantic inconsistency introduced by visual tokens. Recent studies reveal that visual tokens in large VLMs are highly redundant, and most of them can be removed without compromising generation quality. Motivated by this observation, we propose HiViS (Hiding Visual T okens from the Drafter for Speculative Decoding in Vision-Language Models), a framework that utilizes the target VLM as a semantic fusion model, allowing the drafter to obtain visual information without explicitly processing visual tokens, ensuring that the drafter's prefill sequence length matches that of the textual tokens. Furthermore, HiViS employs a time-step-aware aligned training scheme that allows the drafter to autonomously propagate and refine instructive visual-textual semantics during independent drafting, guided by step-dependent bias-correction residuals. Extensive experiments across representative VLMs and benchmarks demonstrate that HiViS achieves significant improvements in average acceptance length and speedup ratio.
- North America > United States (0.04)
- Asia > Singapore (0.04)
- Asia > China > Hong Kong (0.04)
Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
Shao, Zelei, Srivatsa, Vikranth, Srivastava, Sanjana, Wu, Qingyang, Ariyak, Alpay, Wu, Xiaoxia, Patel, Ameen, Wang, Jue, Liang, Percy, Dao, Tri, Zhang, Ce, Zhang, Yiying, Athiwaratkun, Ben, Xu, Chenfeng, Wang, Junxiong
Reinforcement learning(RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck:the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall clock time and a complementary opportunity; the availability of historical rollouts that reveal stable prompt level patterns across training epochs. Motivated by these observations, we propose DAS, a Distribution Aware Speculative decoding framework that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: an adaptive, nonparametric drafter built from recent rollouts using an incrementally maintained suffix tree, and a length aware speculation policy that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time up to 50% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post training without compromising learning quality.
- North America > United States > Louisiana (0.04)
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
Steering Pretrained Drafters during Speculative Decoding
Berdoz, Frédéric, Rheinboldt, Peer, Wattenhofer, Roger
Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall effectiveness. While small drafting heads trained from scratch compensate with speed, they struggle when verification dominates latency or when inputs are out of distribution. In contrast, pretrained drafters, though slower, achieve higher acceptance rates thanks to stronger standalone generation capabilities, making them competitive when drafting latency is negligible relative to verification or communication overhead. In this work, we aim to improve the acceptance rates of pretrained drafters by introducing a lightweight dynamic alignment mechanism: a steering vector computed from the verifier's hidden states and injected into the pretrained drafter. Compared to existing offline alignment methods such as distillation, our approach boosts the number of accepted tokens by up to 35\% under standard sampling and 22\% under greedy sampling, all while incurring negligible computational overhead. Importantly, our approach can be retrofitted to existing architectures and pretrained models, enabling rapid adoption.
- North America > United States (0.14)
- Europe > Switzerland > Zürich > Zürich (0.04)
DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
Zhang, Jinbin, Ullah, Nasib, Schultheis, Erik, Babbar, Rohit
Speculative decoding has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed top frequent subset of the target model's vocabulary. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. Across standard speculative decoding benchmarks, DynaSpec delivers consistent improvements in mean accepted length, for Llama-3-8B, reaching upto 98.2% of full-vocabulary performance, while fixed-shortlist baselines attain only 84.4%. By leveraging context-dependent selection, DynaSpec achieves up to a 2.18 times increase in generated tokens compared to 1.91 times for fixed-vocabulary approaches.
- North America > United States (0.28)
- Europe > Austria > Vienna (0.14)
- Europe > Finland (0.04)
- (2 more...)
SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding
Sandler, Jameson, Christopher, Jacob K., Hartvigsen, Thomas, Fioretto, Ferdinando
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: (1) the autoregressive dependency during drafting which limits parallelism, and (2) frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that SpecDiff-2 achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of +55% over previous baselines and obtaining up to 5.5x average speed-up over standard decoding, without any loss of accuracy.
- North America > United States > Virginia > Albemarle County > Charlottesville (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe > Austria > Burgenland > Eisenstadt (0.04)
When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding
Fang, Min, Fu, Zhihui, Zhao, Qibin, Wang, Jun
Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting model. While model-based methods like EAGLE-2 are accurate but costly, retrieval-enhanced methods like SAM-Decoding rely on heuristic switching strategies that often trigger unnecessary retrievals. To address this, we propose ReSpec (\textbf{Re}trieval-enhanced \textbf{Spe}culative Decoding), a novel framework that transforms heuristic drafter switching into adaptive decision-making. ReSpec features three core innovations: 1) An \textbf{entropy-guided adaptive trigger} quantifies contextual predictability to initiate retrieval only when uncertainty is low, avoiding costly low-quality speculations. 2) A \textbf{feedback-driven candidate selection} leverages historical feedback to organize multiple high-quality candidates for parallel verification, maximizing retrieval utility. 3) A source-aware \textbf{relaxed verification strategy} applies strict checks to model-generated drafts while using a relaxed verification for retrieved drafts, achieving a better balance between accuracy and efficiency. Extensive experiments on Spec-Bench demonstrate that ReSpec achieves state-of-the-art acceleration,outperforming EAGLE-2 and SAM-Decoding by over $33\%$ and $25\%$, respectively, while maintaining output quality.
- North America > United States (0.14)
- Asia > China > Guangdong Province > Shenzhen (0.04)
ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems
Chen, Qiaoling, Liu, Zijun, Sun, Peng, Li, Shenggui, Wang, Guoteng, Liu, Ziming, Wen, Yonggang, Feng, Siyuan, Zhang, Tianwei
We identify three critical gaps that hinder the na ıve integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. Among these stages, generation is consistently the dominant bottleneck (Zhong et al., 2025a). Classic policy optimization methods such as PPO (Schul-man et al., 2017; 2015) combine trajectory-level rewards A natural optimization to address this bottleneck is speculative decoding (SD) (Leviathan et al., 2023; Chen et al., SD has already been widely adopted in LLM serving systems (e.g., SGLang (Zheng et al., 2024), Among various SD variants, EAGLE-3 (Li et al., 2025) represents the current state of the art, achieving the Overview of SD in RL training and our proposed ReSpec system. Consequently, a single static SD configuration cannot provide reliable speedups across diverse RL workloads, as shown in Figure 3. (G2) Drafter staleness under continual actor updates. As the actor (i.e., target) model evolves with each policy update, a fixed drafter rapidly becomes misaligned with the actor Moreover, this variance increases with drafter staleness (G2), causing a higher ratio of impoverished trajectories.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)
Fast Inference via Hierarchical Speculative Decoding
Mohri, Clara, Kaplan, Haim, Schuster, Tal, Mansour, Yishay, Globerson, Amir
Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives up to 1.2x speed-up over the best single-draft baseline, demonstrating the practicality of our algorithm in reducing generation latency beyond previous techniques.
- North America > United States (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs
Liu, Hongyi, Huang, Jiaji, Jia, Zhen, Park, Youngsuk, Wang, Yu-Xiang
Speculative decoding is widely used in accelerating large language model (LLM) inference. In this work, we focus on the online draft model selection problem in speculative decoding. We design an algorithm that provably competes with the best draft model in hindsight for each query in terms of either the token acceptance probability or expected acceptance length. In particular, we show that we can accurately evaluate all draft models, instead of only the chosen model without incurring additional queries to the target model, which allows us to improve exponentially over the existing bandit-based approach as the number of draft models increases. Our approach is generically applicable with any speculative decoding methods (single draft, multi-drafts and draft-trees). Moreover, we design system-efficient versions of online learners and demonstrate that the overhead in computation and latency can be substantially reduced. We conduct extensive experiments on open-source LLMs and diverse datasets, demonstrating that our methods substantially outperform the state-of-the-art EAGLE3 and the BanditSpec baseline in a variety of domains where specialized domain-expert drafters are available, especially when long reasoning chains are required.
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)