FastEagle: Cascaded Drafting for Accelerating Speculative Decoding
Huang, Haiduo, Song, Jiangcheng, Zhao, Wenzhe, Ren, Pengju
–arXiv.org Artificial Intelligence
Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass. FastEagle replaces temporal steps with a lightweight layer cascade and trains with layer-wise supervision to mitigate error accumulation. Coupled with a constrained draft tree that preserves lossless verification cost, FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior. Across multiple LLMs (Vicuna-13B, LLaMA-Instruct 3.x, and DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM, Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both greedy and stochastic decoding, with comparable average acceptance lengths. These results indicate that removing sequential dependencies in drafting is a practical path toward lossless LLM inference acceleration.
arXiv.org Artificial Intelligence
Sep-26-2025
- Country:
- Asia > China
- Shaanxi Province > Xi'an (0.40)
- North America > United States (0.06)
- Asia > China
- Genre:
- Research Report (0.64)
- Technology: