Accelerating LLM Inference with Staged Speculative Decoding

Aug-8-2023–arXiv.org Artificial Intelligence

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.

draft model, large language model, machine learning, (14 more...)

arXiv.org Artificial Intelligence

Aug-8-2023

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Hawaii > Honolulu County
    - Honolulu (0.04)
  - California > Santa Clara County
    - Palo Alto (0.04)

Genre:
- Research Report (0.50)

Industry:
- Information Technology (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.91)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found