Goto

Collaborating Authors

 serverlessllm


Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud

arXiv.org Artificial Intelligence

These models, due to their size--often reaching hundreds of gigabytes--and computational requirements, encounter delays due to what is known as the coldstart This review report discusses the cold start latency in problem [22]. This latency arises when serverless serverless inference and existing solutions. It particularly functions, previously idle, initiate, leading to delays reviews the ServerlessLLM method, a system from the loading of extensive LLM checkpoints designed to address the cold-start problem in serverless and GPU resource activation. Such cold starts can inference for large language models (LLMs). Traditional significantly hinder performance in applications requiring serverless approaches struggle with high latency real-time interaction, making solutions to this due to the size of LLM checkpoints and the problem imperative for scalable, serverless LLM deployment.


ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models

arXiv.org Artificial Intelligence

Furthermore, LLM inference latency is difficult to predict because their response time depends on the output This paper presents ServerlessLLM, a locality-enhanced length, which can vary significantly [24, 39, 77], due to iterative serverless inference system for Large Language Models output token generation. To achieve low latency, processing (LLMs). ServerlessLLM exploits the substantial capacity and an LLM request often necessitates the use of several bandwidth of storage and memory devices available on GPU GPUs for durations ranging from seconds to minutes. In practice, servers, thereby reducing costly remote checkpoint downloads LLM service providers need to host a large number of and achieving efficient checkpoint loading. ServerlessLLM LLMs catered to different developers, leading to significant achieves this through three main contributions: (i) fast LLM GPU consumption [15] and impeding the sustainability of checkpoint loading via a novel loading-optimized checkpoint LLM services [19]. As a result, LLM inference services have format design, coupled with an efficient multi-tier checkpoint to impose strict caps on the number of requests sent to their loading system; (ii) locality-driven LLM inference with live services from their users (e.g., 40 messages per 3 hours for migration, which allows ServerlessLLM to effectively achieve ChatGPT [51]), showing the provider's current inability to locality-driven server allocation while preserving the low latency satisfy the LLM inference demand. Researchers [19] project of ongoing LLM inference; and (iii) locality-aware that LLM inference costs may increase by > 50 when it server allocation, enabling ServerlessLLM to evaluate the status reaches the popularity of Google Search. of each server in a cluster and effectively schedule model To reduce GPU consumption, LLM service providers are startup time to capitalize on local checkpoint placement. Our exploring serverless inference, as seen in systems like Amazon comprehensive experiments, which include microbenchmarks SageMaker [60], Azure [46], KServe [11] and Hugging-and real-world traces, show that ServerlessLLM surpasses Face [31].