eagle-2
ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Kang, Jialiang, Shu, Han, Li, Wenshuo, Zhai, Yingjie, Chen, Xinghao
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.
POSS: Position Specialist Generates Better Draft for Speculative Decoding
Huang, Langlin, Huang, Chengsong, Leng, Jixuan, Huang, Di, Huang, Jiaxin
Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.
Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Zhang, Yudi, Zhao, Weilin, Han, Xu, Zhao, Tiejun, Xu, Wang, Cao, Hailong, Zhu, Conghui
Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.
- North America > United States (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Asia > China > Beijing > Beijing (0.04)
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Zhao, Weilin, Pan, Tengyu, Han, Xu, Zhang, Yudi, Sun, Ao, Huang, Yuxiang, Zhang, Kaihuo, Zhao, Weilun, Li, Yuxuan, Wang, Jianyong, Liu, Zhiyuan, Sun, Maosong
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.
- North America > United States (0.29)
- Asia > China > Beijing > Beijing (0.04)
- Asia > China > Heilongjiang Province > Harbin (0.04)
EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Li, Yuhui, Wei, Fangyun, Zhang, Chao, Zhang, Hongyang
The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. The code is available at https://github.com/SafeAILab/EAGLE.
C2T: A Classifier-Based Tree Construction Method in Speculative Decoding
Huo, Feiye, Tan, Jianchao, Zhang, Kefeng, Cai, Xunliang, Sun, Shengli
The growing scale of Large Language Models (LLMs) has exacerbated inference latency and computational costs. Speculative decoding methods, which aim to mitigate these issues, often face inefficiencies in the construction of token trees and the verification of candidate tokens. Existing strategies, including chain mode, static tree, and dynamic tree approaches, have limitations in accurately preparing candidate token trees for verification. We propose a novel method named C2T that adopts a lightweight classifier to generate and prune token trees dynamically. Our classifier considers additional feature variables beyond the commonly used joint probability to predict the confidence score for each draft token to determine whether it is the candidate token for verification. This method outperforms state-of-the-art (SOTA) methods such as EAGLE-2 on multiple benchmarks, by reducing the total number of candidate tokens by 25% while maintaining or even improving the acceptance length.
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
SAM Decoding: Speculative Decoding via Suffix Automaton
Hu, Yuxuan, Wang, Ke, Zhang, Xiaokang, Zhang, Fanjin, Li, Cuiping, Chen, Hong, Zhang, Jing
Speculative decoding (SD) has been demonstrated as an effective technique for lossless LLM inference acceleration. Retrieval-based SD methods, one kind of model-free method, have yielded promising speedup, but they often rely on incomplete retrieval resources, inefficient retrieval methods, and are constrained to certain domains. This paper presents a novel retrieval-based speculative decoding method that adapts suffix automaton (SAM) for efficient and accurate draft generation by utilizing common text corpus and dynamic text sequence. Unlike existing $n$-gram matching methods, SAM-Decoding finds the exact longest suffix match, achieving an average time complexity of O(1) per generation step of SAM update and suffix retrieval. It can also integrate with existing methods, adaptively selecting a draft generation strategy based on match length to generalize to broader domains. Extensive experiments on Spec-Bench show that our method is $18\%+$ faster than other retrieval-based SD methods. Additionally, when combined with advanced EAGLE-2, it provides an additional speedup of $3.28\%$ -- $11.13\%$ across various-sized LLM backbones. Our code is available at our \href{https://github.com/hyx1999/SAM-Decoding}{repository}.
- North America > United States (0.14)
- Asia > China > Beijing > Beijing (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (2 more...)
Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding
Zhang, Ziyin, Xu, Jiahao, Liang, Tian, Chen, Xingyu, He, Zhiwei, Wang, Rui, Tu, Zhaopeng
Speculative Decoding (SD) has become an important technique in accelerating the inference speed of large language models. Conventional SD methods employ a fixed draft length, which ignores the token generation difficulty across tasks. Consequently, in this paper, we address such an issue and introduce SVIP - a difficulty-aware dynamic draft length policy for speculative decoding systems. Based on a theoretical lower bound of draft token acceptance rate and its inference-time approximation, SVIP adaptively determines the lengths of draft sequences based on the entropy of each draft token distribution. Experimental results on mainstream SD benchmarks and frameworks demonstrate the superior performance of SVIP, achieving up to 20\% walltime speedup on SpecBench over baseline SD methods and 60\% speedup on MT-Bench for long-form generation of up to 8K tokens. Moreover, SVIP is totally training-free and compatible with any existing SD methods that generate draft tokens autoregressively. Experimental results also show that SVIP yields consistent walltime improvement on top of GliDe & CaPE and EAGLE-2.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (5 more...)
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding
Jang, Doohyuk, Park, Sihwan, Yang, June Yong, Jung, Yeonsung, Yun, Jihun, Kundu, Souvik, Kim, Sung-Yub, Yang, Eunho
Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes tokens one at a time, slowing down generation compared to models like GANs or diffusion-based methods that operate more efficiently. While speculative decoding has proven effective for accelerating LLMs by generating multiple tokens in a single forward, its application in visual AR models remains largely unexplored. In this work, we identify a challenge in this setting, which we term \textit{token selection ambiguity}, wherein visual AR models frequently assign uniformly low probabilities to tokens, hampering the performance of speculative decoding. To overcome this challenge, we propose a relaxed acceptance condition referred to as LANTERN that leverages the interchangeability of tokens in latent space. This relaxation restores the effectiveness of speculative decoding in visual AR models by enabling more flexible use of candidate tokens that would otherwise be prematurely rejected. Furthermore, by incorporating a total variation distance bound, we ensure that these speed gains are achieved without significantly compromising image quality or semantic coherence. Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding. In specific, compared to a na\"ive application of the state-of-the-art speculative decoding, LANTERN increases speed-ups by $\mathbf{1.75}\times$ and $\mathbf{1.76}\times$, as compared to greedy decoding and random sampling, respectively, when applied to LlamaGen, a contemporary visual AR model.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States (0.04)
- North America > Dominican Republic (0.04)
Dynamic Depth Decoding: Faster Speculative Decoding for LLMs
Brown, Oscar, Wang, Zhengjie, Do, Andrea, Mathew, Nikhil, Yu, Cheng
The acceleration of Large Language Models (LLMs) with speculative decoding provides a significant runtime improvement without any loss of accuracy. Currently, EAGLE-2 is the state-of-the-art speculative decoding method, improving on EAGLE with a dynamic draft tree. We introduce Dynamic Depth Decoding (DDD), which optimises EAGLE-2's tree drafting method using a dynamic depth. This extends the average speedup that EAGLE-2 achieves over EAGLE by $44\%$, giving DDD an average speedup of $3.16$x.