Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding

Park, Jihoon, Oh, Seungeun, Kim, Seong-Lyun

Aug-19-2025–arXiv.org Artificial Intelligence

We propose a novel uncertainty-and importance-aware speculative decoding framework that opportunistically skips LLM verification based on local token statistics. To mitigate attention collapse, we design an adaptive importance threshold that adjusts dynamically based on the distribution of attention weights at each decoding step. We provide extensive evaluations showing that our framework significantly reduces LLM usage, bandwidth, and energy costs--while maintaining or exceeding the accuracy of prior methods. We show that our framework is tunable: the strictness of the upload condition can be adjusted to achieve desired trade-offs across accuracy, latency, and energy efficiency. The remainder of this paper is organized as follows. Section II introduces the system and wireless communication model. Section III presents the proposed opportunistic skipping mechanism based on token uncertainty and importance. Section IV evaluates the performance of our method in terms of accuracy, latency, token throughput, and energy efficiency. Section V concludes with key findings and potential future directions.

large language model, natural language, throughput, (17 more...)

arXiv.org Artificial Intelligence

Aug-19-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.50)

Industry:
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found