SHARP: Accelerating Language Model Inference by SHaring Adjacent layers with Recovery Parameters

Wang, Yiping, Huang, Hanxian, Chen, Yifang, Zhao, Jishen, Du, Simon Shaolei, Tian, Yuandong

Feb-10-2025–arXiv.org Artificial Intelligence

While Large language models (LLMs) have advanced natural language processing tasks, their growing computational and memory demands make deployment on resource-constrained devices like mobile phones increasingly challenging. In this paper, we propose SHARP (SHaring Adjacent Layers with Recovery Parameters), a novel approach to accelerate LLM inference by sharing parameters across adjacent layers, thus reducing memory load overhead, while introducing low-rank recovery parameters to maintain performance. Inspired by observations that consecutive layers have similar outputs, SHARP employs a two-stage recovery process: Single Layer Warmup (SLW), and Supervised Fine-Tuning (SFT). Extensive experiments demonstrate that SHARP can recover the model's perplexity on various in-distribution tasks using no more than 50k fine-tuning data while reducing the number of stored MLP parameters by 38% to 65%. We also conduct several ablation studies of SHARP and show that replacing layers towards the later parts of the model yields better performance retention, and that different recovery parameterizations perform similarly when parameter counts are matched. Furthermore, SHARP saves 42.8% in model storage and reduces the total inference time by 42.2% compared to the original Llama2-7b model on mobile devices. Our results highlight SHARP as an efficient solution for reducing inference costs in deploying LLMs without the need for pretraining-scale resources. However, deploying a pre-trained large language model requires significant computational and memory resources (Aminabadi et al., 2022; Pope et al., 2023; Kim et al., 2023b; Zhang et al., 2024b), which may further restrict their inference speed. For instance, a 70-billion-parameter language model stored in FP16 precision requires approximately 148GB of memory to hold the model weights, necessitating two A100 GPUs with 80GB of memory each to load the entire model. During inference, the entire input sequence and the KV cache are also stored on the GPU, incurring additional memory usage. They repeat the layer twice and train the model from scratch. SHARP leverages fine-tuning-scale data to train additional parameters Θ, which consist of far fewer parameters than the original Θ, in order to recover the model's performance. In this paper, we explore several candidate transformations, including the LoRA-style function, to apply on additional parameters. In particular, these concerns are significant for deployment on mobile devices, which typically have smaller DRAM (e.g., around 6GB in the iPhone 15) and higher communication overhead (Liu et al., 2024).

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

Feb-10-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:
- Research Report > New Finding (0.87)

Industry:
- Education (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)