TurboAttention: Efficient Attention Approximation For High Throughputs LLMs

Kang, Hao, Bharadwaj, Srikant, Hensman, James, Krishna, Tushar, Ruhle, Victor, Rajmohan, Saravan

Dec-17-2024–arXiv.org Artificial Intelligence

While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x Large language models (LLMs) (Touvron et al., 2023; Gunasekar et al., 2023; Brown et al., 2020) have excelled The bottlenecks during LLM inference can be split into in tasks like natural language understanding (Joshi et al., three major sections: the linear projection operations (QKV 2017; Dodge et al., 2021) and generative text production projection and FFN), the memory-intensive Key/Value (KV) (Hendrycks et al., 2021; Zhong et al., 2017).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Dec-17-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States (0.28)

Genre:
- Research Report > New Finding (0.87)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language > Large Language Model (1.00)