kv cache
Country:
- Oceania > Australia (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Asia > China (0.04)
Genre:
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
Technology:
Country:
- North America > United States (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Singapore > Central Region > Singapore (0.04)
- Asia > China > Beijing > Beijing (0.04)
Technology:
Country:
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Italy > Tuscany > Florence (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- (2 more...)
Technology:
Country:
- Europe > Austria (0.04)
- Africa > Ethiopia > Addis Ababa > Addis Ababa (0.04)
Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Country:
Genre:
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
Technology:
Country:
- Europe > Austria > Vienna (0.14)
- North America > United States > Virginia (0.04)
- North America > United States > Texas (0.04)
- (3 more...)
Technology:
Country:
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (5 more...)
Technology:
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Increasing GPU Utilization during Generative Inference for Higher Throughput
Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem.
Country:
- North America > United States > California > San Diego County > Carlsbad (0.04)
- Asia > Taiwan (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- Asia > China (0.04)
Technology: