Multi-Bin Batching for Increasing LLM Inference Throughput

Guldogan, Ozgur, Kunde, Jackson, Lee, Kangwook, Pedarsani, Ramtin

Dec-2-2024–arXiv.org Artificial Intelligence

Large Language Model (LLM) inference systems are becoming increasingly popular due to their various abilities, such as text generation (Li et al., 2024), coding assistance (Chen et al., 2021), and question answering (Jiang et al., 2021). As the demand for LLM inference systems grows, so does the need to optimize their efficiency. Several techniques have been proposed to improve the efficiency of LLM inference systems, and batched inference (Sheng et al., 2023; Kwon et al., 2023; Jin et al., 2023) is one of the most promising techniques among them. With batched inference, multiple requests are processed simultaneously, using the underlying hardware's parallelism to improve throughput. Figure 1(a) shows the measured throughput of the Phi-3.5 Mini Instruct model (Abdin et al., 2024) for various batch sizes on an NVIDIA A100 80G GPU. Throughput is calculated as the number of total tokens generated across all requests divided by time. However, batched inference comes with some critical drawbacks. The execution time of each request depends on the number of tokens generated, which varies across requests. In standard batched inference systems, a computing unit remains locked until all requests in the batch are completed, leading to resource underutilization when requests within a batch have widely differing execution times.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

Dec-2-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.46)

Genre:
- Research Report
  - New Finding (0.46)
  - Promising Solution (0.34)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found