Efficient LLM Inference with Kcache
–arXiv.org Artificial Intelligence
Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. After that, the part of V Cache will be copied to the CPU asynchronously, while releasing the GPU memory occupied by this part of the V Cache. During decode phase, K states will be pushed and pulled as KV Cache. However, we will calculate the topN of attention scores, and based on the indices of the topN results, we will pull the corresponding V Cache from the CPU to the HBM in real-time to complete the subsequent computation. Through this simple approach, leveraging the structural characteristics of Transformer models, we effectively utilize the idle CPU memory, increasing the capacity of HBM. In this paper, we build InferenceEngine based on KCache that efficiently reduces the memory footprint during LLM inference, which achieved 40% increased throughput and keeping accuracy.
arXiv.org Artificial Intelligence
Apr-27-2024