speculation length
DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving
Yang, Mingyu, Choi, Jae-Young, Moon, Kihyo, Jang, Minsung, Jeon, Eunjoo
Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation's regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
Efficient Speculative Decoding for Llama at Scale: Challenges and Solutions
Tang, Bangsheng, Fu, Carl Chengyan, Kou, Fei, Sizov, Grigory, Zhang, Haoci, Park, Jason, Liu, Jiawen, You, Jie, Yang, Qirui, Mehta, Sachin, Cai, Shengyong, Wang, Xiaodong, Liu, Xingyu, Li, Yunlu, Zhou, Yanjun, Wei, Wei, Zhao, Zhiwei, Qi, Zixi, Victoria, Adolfo, Ibrahim, Aya, Wasti, Bram, Kim, Changkyu, Haziza, Daniel, Sun, Fei, Delfin, Giancarlo, Guo, Emily, Ouyang, Jialin, Lee, Jaewon, Huang, Jianyu, Reizenstein, Jeremy, Fang, Lu, Zhu, Quinn, Verma, Ria, Mihailescu, Vlad, Guo, Xingwen, Cui, Yan, Hu, Ye, Lee, Yejin
Speculative decoding is a standard method for accelerating the inference speed of large language models. However, scaling it for production environments poses several engineering challenges, including efficiently implementing different operations (e.g., tree attention and multi-round speculative decoding) on GPU. In this paper, we detail the training and inference optimization techniques that we have implemented to enable EAGLE-based speculative decoding at a production scale for Llama models. With these changes, we achieve a new state-of-the-art inference latency for Llama models. For example, Llama4 Maverick decodes at a speed of about 4 ms per token (with a batch size of one) on 8 NVIDIA H100 GPUs, which is 10% faster than the previously best known method. Furthermore, for EAGLE-based speculative decoding, our optimizations enable us to achieve a speed-up for large batch sizes between 1.4x and 2.0x at production scale.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > Middle East > Jordan (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
Zarch, Hossein Entezari, Gao, Lei, Jiang, Chaoyi, Annavaram, Murali
Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches to create a draft model, one promising approach is to use early-exit methods. These methods draft candidate tokens by using a subset of layers of the primary model and applying the remaining layers for verification, allowing a single model to handle both drafting and verification. While this technique reduces memory usage and computational cost, its performance relies on the choice of the exit layer for drafting and the number of tokens drafted (speculation length) in each SD round. Prior works use hyperparameter exploration to statically select these values. However, our evaluations show that these hyperparameter values are task-specific, and even within a task they are dependent on the current sequence context. We introduce DEL (Dynamic Exit Layer), a plug-and-play method that adaptively selects the exit layer and speculation length during inference. DEL dynamically tracks the token acceptance rate if the tokens are drafted at each layer of an LLM and uses that knowledge to heuristically select the optimal exit layer and speculation length. Our experiments across a broad range of models and downstream tasks show that DEL achieves overall speedups of $2.16\times$$\sim$$2.62\times$ over vanilla auto-regressive decoding and improves upon state-of-the-art SD methods, which peak at $2.43\times$, by up to $0.19\times$. The code is available at https://github.com/hoenza/DEL.
- North America > United States > California (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (3 more...)
Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding
Gautam, Aayush, Shrestha, Susav, Reddy, Narasimha
Speculative decoding accelerates large language model (LLM) inference by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, selecting an optimal speculation length is critical for maximizing speedup while minimizing wasted computation. We introduce \textit{GammaTune} and \textit{GammaTune+}, training-free adaptive algorithms that dynamically adjust speculation length based on token acceptance rates using a heuristic-based switching mechanism. Evaluated on SpecBench across multiple tasks and model pairs, our method outperforms other heuristic-based approaches and fixed-length speculative decoding, achieving an average speedup of 15\% ($\pm$5\%) with \textit{GammaTune} and 16\% ($\pm$3\%) with \textit{GammaTune+}, while reducing performance variance. This makes \textit{GammaTune} a robust and efficient solution for real-world deployment.
- North America > United States > Texas > Brazos County > College Station (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Florida > Miami-Dade County > Miami (0.05)
- (2 more...)
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
He, Yintao, Mao, Haiyu, Giannoula, Christina, Sadrosadati, Mohammad, Gómez-Luna, Juan, Li, Huawei, Li, Xiaowei, Wang, Ying, Mutlu, Onur
Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8$\times$ and 11.1$\times$ speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.
SSSD: Simply-Scalable Speculative Decoding
Marzollo, Michele, Zhuang, Jiawei, Roemer, Niklas, Müller, Lorenz K., Cavigelli, Lukas
Over the past year, Speculative Decoding has gained popularity as a technique for accelerating Large Language Model inference. While several methods have been introduced, most struggle to deliver satisfactory performance at batch sizes typical for data centers ($\geq 8$) and often involve significant deployment complexities. In this work, we offer a theoretical explanation of how Speculative Decoding can be effectively utilized with larger batch sizes. We also introduce a method that integrates seamlessly into existing systems without additional training or the complexity of deploying a small LLM. In a continuous batching setting, we achieve a 4x increase in throughput without any latency impact for short context generation, and a 1.7-2x improvement in both latency and throughput for longer contexts.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- (3 more...)
The Synergy of Speculative Decoding and Batching in Serving Large Language Models
Su, Qidong, Giannoula, Christina, Pekhimenko, Gennady
Large Language Models (LLMs) like GPT are state-of-the-art text generation models that provide significant assistance in daily routines. However, LLM execution is inherently sequential, since they only produce one token at a time, thus incurring low hardware utilization on modern GPUs. Batching and speculative decoding are two techniques to improve GPU hardware utilization in LLM inference. To study their synergy, we implement a prototype implementation and perform an extensive characterization analysis on various LLM models and GPU architectures. We observe that the optimal speculation length depends on the batch size used. We analyze the key observation and build a quantitative model to explain it. Based on our analysis, we propose a new adaptive speculative decoding strategy that chooses the optimal speculation length for different batch sizes. Our evaluations show that our proposed method can achieve equal or better performance than the state-of-the-art speculation decoding schemes with fixed speculation length.