TurboAttention: Efficient Attention Approximation For High Throughputs LLMs

Kang, Hao, Bharadwaj, Srikant, Hensman, James, Krishna, Tushar, Ruhle, Victor, Rajmohan, Saravan

arXiv.org Artificial Intelligence 

While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x Large language models (LLMs) (Touvron et al., 2023; Gunasekar et al., 2023; Brown et al., 2020) have excelled The bottlenecks during LLM inference can be split into in tasks like natural language understanding (Joshi et al., three major sections: the linear projection operations (QKV 2017; Dodge et al., 2021) and generative text production projection and FFN), the memory-intensive Key/Value (KV) (Hendrycks et al., 2021; Zhong et al., 2017).