AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design

Liang, Yanbiao, Shi, Huihong, Shao, Haikuo, Wang, Zhongfeng

May-8-2025–arXiv.org Artificial Intelligence

--Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on resource-constrained edge devices poses significant challenges, including (1) intensive computations and huge model sizes, (2) great memory and bandwidth demands introduced by the autoregressive generation process, and (3) limited scalability for handling long sequences. T o address these challenges, we propose AccLLM, a comprehensive acceleration framework that enables efficient and fast long-context LLM inference through algorithm and hardware co-design. At the algorithmic level, we integrate (1) pruning, (2) Λ-shaped attention, and (3) an innovative W 2A 8 KV4 ( 2-bit weights, 8-bit activations, and 4 - bit KV cache) quantization scheme, thus effectively reducing memory and bandwidth requirements while facilitating LLMs' long-sequence generation. At the hardware level, we design a dedicated FPGA-based accelerator with a reconfigurable computing engine to effectively and flexibly accommodate diverse operations arising from our compression algorithm, thereby fully translating the algorithmic innovations into tangible hardware efficiency. Large language models [1]-[4] (LLMs) have revolutionized natural language processing (NLP) with their outstanding capabilities, enabling a wide range of applications [5], including code generation [6], document summarization [7], chatbots [2], and question answering [8]. This impressive potential has driven growing interest in extending LLMs' deploying beyond traditional cloud-based platforms to edge devices, such as smart vehicles, robots, and embedded systems [9]-[11]. However, mainstream works have merely focused on optimizing and accelerating LLMs on GPUs [12], [13] with powerful resource capacity, making them unsuitable for resource-constrained edge scenarios [14], [15]. This work was supported by the National Key R&D Program of China under Grant 2022YFB4400600. Zhongfeng Wang is with the School of Electronic Science and Engineering, Nanjing University, and the School of Integrated Circuits, Sun Y at-sen University (email: zfwang@nju.edu.cn). Correspondence should be addressed to Zhongfeng Wang. Figure 1.

large language model, machine learning, quantization, (19 more...)

arXiv.org Artificial Intelligence

May-8-2025

arXiv.org PDF

Add feedback

Country:
- Asia
  - China > Jiangsu Province
    - Nanjing (0.24)
  - Middle East > Israel (0.04)

Genre:
- Research Report (1.00)

Industry:
- Education (0.54)
- Information Technology (0.34)
- Semiconductors & Electronics (0.35)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.95)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found