UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs

Xiong, Yizhe, Huang, Wei, Ye, Xin, Chen, Hui, Lin, Zijia, Lian, Haoran, Su, Zhenpeng, Han, Jungong, Ding, Guiguang

Feb-1-2025–arXiv.org Artificial Intelligence

Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, intra-layer KV sharing still results in high inference costs, while cross-layer KV sharing leads to significant performance degradation. As a result, both methods remain suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \texttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbf{Uni}fication in \textbf{Att}e\textbf{n}tion (\textbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training. Our code will be available at \url{https://github.com/Bostoncake/UniAttn}.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Feb-1-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Western Australia > Perth (0.04)
- North America > United States
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - Louisiana > Orleans Parish
    - New Orleans (0.04)
- Europe
  - Austria > Vienna (0.14)
  - Italy > Lombardy
    - Milan (0.04)
- Asia
  - Singapore (0.04)
  - Indonesia > Bali (0.04)
  - British Indian Ocean Territory > Diego Garcia (0.04)
  - Thailand > Bangkok
    - Bangkok (0.04)
  - China
    - Hong Kong (0.04)
    - Beijing > Beijing (0.04)
- Africa > Rwanda
  - Kigali > Kigali (0.04)

Genre:
- Research Report > New Finding (0.67)

Industry:
- Health & Medicine (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found