Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Jan-19-2025, 22:47:40 GMT–Neural Information Processing Systems

Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs. Our approach begins by tapping into the potential of LLMs to accurately perceive and predict the response length with minimal overhead. By leveraging this information, we introduce an efficient sequence scheduling technique that groups queries with similar response lengths into micro-batches.

length perception and sequence scheduling, llm-empowered llm inference pipeline, response length perception

Neural Information Processing Systems

Jan-19-2025, 22:47:40 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)