Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding
Park, Jihoon, Oh, Seungeun, Kim, Seong-Lyun
–arXiv.org Artificial Intelligence
We propose a novel uncertainty-and importance-aware speculative decoding framework that opportunistically skips LLM verification based on local token statistics. To mitigate attention collapse, we design an adaptive importance threshold that adjusts dynamically based on the distribution of attention weights at each decoding step. We provide extensive evaluations showing that our framework significantly reduces LLM usage, bandwidth, and energy costs--while maintaining or exceeding the accuracy of prior methods. We show that our framework is tunable: the strictness of the upload condition can be adjusted to achieve desired trade-offs across accuracy, latency, and energy efficiency. The remainder of this paper is organized as follows. Section II introduces the system and wireless communication model. Section III presents the proposed opportunistic skipping mechanism based on token uncertainty and importance. Section IV evaluates the performance of our method in terms of accuracy, latency, token throughput, and energy efficiency. Section V concludes with key findings and potential future directions.
arXiv.org Artificial Intelligence
Aug-19-2025