Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exiting

May-28-2025, 13:51:57 GMT–Neural Information Processing Systems

Speculative decoding has demonstrated its effectiveness in accelerating the inference of large language models (LLMs) while maintaining an identical sampling distribution. However, the conventional approach of training separate draft model to achieve a satisfactory token acceptance rate can be costly and impractical. In this paper, we propose a novel self-speculative decoding framework Kangaroo with double early exiting strategy, which leverages the shallow sub-network and the LM Head of the well-trained target LLM to construct a self-drafting model. Then, the self-verification stage only requires computing the remaining layers over the early-exited hidden states in parallel. To bridge the representation gap between the sub-network and the full model, we train a lightweight and efficient adapter module on top of the sub-network.

arxiv preprint arxiv, large language model, natural language, (14 more...)

Neural Information Processing Systems

May-28-2025, 13:51:57 GMT

Conferences PDF

Add feedback

Genre:
- Research Report > Experimental Study (0.93)

Industry:
- Education (0.34)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)