CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference
Zhou, Enyu, Sheng, Kai, Chen, Hao, He, Xin
–arXiv.org Artificial Intelligence
Speculative decoding (SD), where a draft model provides multiple candidate tokens for the target model to verify in parallel, has demonstrated significant potential for accelerating LLM inference. Y et, existing SD approaches adhere to a strict "draft-then-verify" paradigm, enforcing a sequential process that hampers performance and constrains the draft model's capacity. Moreover, rejecting a token in the candidate sequence invalidates all subsequent tokens, leading to wasted computation during drafting. To overcome these limitations, we propose a cache-assisted parallel speculative decoding framework called CARD, which employs a novel "query-and-correct" paradigm. Our approach decouples drafting from verification: the draft model populates a shared cache with candidate tokens, while the target model concurrently refines the draft's trajectory. This enables inference at near-draft-speed, effectively leveraging the draft model's efficiency without additional fine-tuning. Experimental results show that CARD significantly outperforms existing state-of-the-art methods, achieving up to a 4.83 acceleration over vanilla autoregressive decoding, with no fine-tuning required for either models.
arXiv.org Artificial Intelligence
Sep-22-2025
- Country:
- Asia > China
- Guangdong Province > Guangzhou (0.04)
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States (0.04)
- Asia > China
- Genre:
- Research Report (1.00)
- Technology: