AITopics | sarathi-serve

Collaborating Authors

sarathi-serve

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts

Zhang, Zeyu, Shen, Haiying

arXiv.org Artificial IntelligenceSep-23-2024

For example, applications applications have become increasingly popular. In this paper, such as book summarization [12-14], document classification through trace-based experiments, we found that the existing [15, 16], and coding assistance [17] require a longer or method for long sequences results in a high Time-To-unlimited sequence length to fully understand the extended First-Token (TTFT) due to sequential chunk processing, long context. Some long-sequence applications, such as coding assistance, Time-Between-Tokens (TBT) from batching long-sequence require short response time (e.g., in seconds). However, prefills and decodes, and low throughput due to constrained through experimental measurements, we made Observation key-value cache (KVC) for long sequences. To address these (O): issues, we propose two Sequence-Parallelism (SP) architectures O1. The existing serving system that handles long sequences, for both tensor parallelism (TP) and non-TP. However, Sarathi-Serve [18], generates long Time-To-First-Token SP introduces two challenges: 1) network communication (TTFT) (in minutes) due to sequential chunk processing, high and computation become performance bottlenecks; 2) the Time-Between-Token (TBT) (e.g., 6 seconds) from batching latter two issues above are mitigated but not resolved, and long-sequence prefills and decodes, and low throughput due SP's resultant KV value distribution across GPUs still requires to small batch size caused by constrained KV cache size and communication for decode, increasing TBT.

computation, gpus, partition, (16 more...)

arXiv.org Artificial Intelligence

2409.15104

Country:

North America > United States > Virginia (0.04)
North America > United States > California > Santa Clara County > Santa Clara (0.04)
North America > United States > California > San Diego County > Carlsbad (0.04)
(6 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Agrawal, Amey, Kedia, Nitin, Panwar, Ashish, Mohan, Jayashree, Kwatra, Nipun, Gulavani, Bhargav S., Tumanov, Alexey, Ramjee, Ramachandran

arXiv.org Artificial IntelligenceJun-17-2024

Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve.

batch, inference, sarathi-serve, (16 more...)

arXiv.org Artificial Intelligence

2403.0231

Country:

North America > United States > New York > New York County > New York City (0.04)
Asia > India (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Information Technology (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback