Increasing GPU Utilization during Generative Inference for Higher Throughput

Oct-8-2025, 11:37:11 GMT–Neural Information Processing Systems

Apart from the already-large model parameters, the key/value (KV) cache that holds information about previous tokens in a sequence can grow to be even larger than the model itself. This problem is exacerbated in one of the current LLM serving frameworks which reserves the maximum sequence length of memory for the KV cache to guarantee generating a complete sequence as they do not know the output sequence length. This restricts us to use a smaller batch size leading to lower GPU utilization and above all, lower throughput. We argue that designing a system with a priori knowledge of the output sequence can mitigate this problem.

sequence, sequence length, throughput, (12 more...)

Neural Information Processing Systems

Oct-8-2025, 11:37:11 GMT

Conferences PDF

Add feedback

Country:
- North America > United States
  - Massachusetts > Suffolk County
    - Boston (0.04)
  - California > San Diego County
    - Carlsbad (0.04)
- Asia
  - Taiwan (0.04)
  - China (0.04)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
Increasing GPU Utilization during Generative Inference for Higher Throughput

Similar Docs Excel Report more

Title	Similarity	Source
None found