CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference

Zhou, Enyu, Sheng, Kai, Chen, Hao, He, Xin

Sep-22-2025–arXiv.org Artificial Intelligence

Speculative decoding (SD), where a draft model provides multiple candidate tokens for the target model to verify in parallel, has demonstrated significant potential for accelerating LLM inference. Y et, existing SD approaches adhere to a strict "draft-then-verify" paradigm, enforcing a sequential process that hampers performance and constrains the draft model's capacity. Moreover, rejecting a token in the candidate sequence invalidates all subsequent tokens, leading to wasted computation during drafting. To overcome these limitations, we propose a cache-assisted parallel speculative decoding framework called CARD, which employs a novel "query-and-correct" paradigm. Our approach decouples drafting from verification: the draft model populates a shared cache with candidate tokens, while the target model concurrently refines the draft's trajectory. This enables inference at near-draft-speed, effectively leveraging the draft model's efficiency without additional fine-tuning. Experimental results show that CARD significantly outperforms existing state-of-the-art methods, achieving up to a 4.83 acceleration over vanilla autoregressive decoding, with no fine-tuning required for either models.

large language model, machine learning, target model, (19 more...)

arXiv.org Artificial Intelligence

Sep-22-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China
  - Guangdong Province > Guangzhou (0.04)
- Europe > Italy
  - Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.31)
  - Natural Language > Large Language Model (1.00)