Efficient LLM Inference with Kcache

Apr-27-2024–arXiv.org Artificial Intelligence

Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. After that, the part of V Cache will be copied to the CPU asynchronously, while releasing the GPU memory occupied by this part of the V Cache. During decode phase, K states will be pushed and pulled as KV Cache. However, we will calculate the topN of attention scores, and based on the indices of the topN results, we will pull the corresponding V Cache from the CPU to the HBM in real-time to complete the subsequent computation. Through this simple approach, leveraging the structural characteristics of Transformer models, we effectively utilize the idle CPU memory, increasing the capacity of HBM. In this paper, we build InferenceEngine based on KCache that efficiently reduces the memory footprint during LLM inference, which achieved 40% increased throughput and keeping accuracy.

bbh cot fewshot, inference, kcache, (11 more...)

arXiv.org Artificial Intelligence

Apr-27-2024

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.40)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found