On Optimal Caching and Model Multiplexing for Large Model Inference

Zhu, Banghua, Sheng, Ying, Zheng, Lianmin, Barrett, Clark, Jordan, Michael I., Jiao, Jiantao

arXiv.org Artificial Intelligence 

This progress comes at a cost, however, of increased resource consumption and latency during both training and inference, presenting challenges not only in real-world deployment but also in terms of environmental impact and energy usage (Sharir et al., 2020; Patterson et al., 2021; Bommasani et al., 2022). For instance, LLM-based chatbots typically consist of large transformer-based networks with parameter counts ranging from one to several hundred billion (Zhou et al., 2023). Moreover, the auto-regressive nature of LLMs exacerbates the issue of latency and resource consumption because the model can only generate one token at a time. Thus, compared to traditional AI-powered services, language model inference costs are much higher and the latency is significantly longer, making it nearly impossible to process each query using LLMs in high-throughput query systems such as search engines. In this paper, we explore two simple yet effective strategies to mitigate this problem: (1) employing a caching system to store previous queries, and (2) developing a model multiplexer to choose the most appropriate model from a set of models for processing the queries. The general workflow of our proposed LLM-based inference system is shown in Figure 1: upon receiving a query or prompt, we initially check if it can be retrieved from the cache. If the query is not found in the cache, we employ the model multiplexer to determine which model should be used for processing it first, based on the estimated cost for both models. The choice of cost function and models can vary based on the goal. One measure of cost, for example, could be floating point operations (FLOPs).

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found