Stateful Large Language Model Serving with Pensieve

Yu, Lingfan, Li, Jinyang

arXiv.org Artificial Intelligence 

Existing LLM serving systems are stateless across In the conversational setup, the user and the chatbot are requests. Consequently, when LLMs are used in the common engaged in a dialogue that may last many rounds. In order setting of multi-turn conversations, a growing log of the conversation for the chatbot not to "lose memory" of what has been said so history must be processed alongside any request far when responding, the cumulative history of the dialogue by the serving system at each turn, resulting in repeated must be part of the context for LLM's autoregressive generation.