CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
–arXiv.org Artificial Intelligence
For example, applications applications have become increasingly popular. In this paper, such as book summarization [12-14], document classification through trace-based experiments, we found that the existing [15, 16], and coding assistance [17] require a longer or method for long sequences results in a high Time-To-unlimited sequence length to fully understand the extended First-Token (TTFT) due to sequential chunk processing, long context. Some long-sequence applications, such as coding assistance, Time-Between-Tokens (TBT) from batching long-sequence require short response time (e.g., in seconds). However, prefills and decodes, and low throughput due to constrained through experimental measurements, we made Observation key-value cache (KVC) for long sequences. To address these (O): issues, we propose two Sequence-Parallelism (SP) architectures O1. The existing serving system that handles long sequences, for both tensor parallelism (TP) and non-TP. However, Sarathi-Serve [18], generates long Time-To-First-Token SP introduces two challenges: 1) network communication (TTFT) (in minutes) due to sequential chunk processing, high and computation become performance bottlenecks; 2) the Time-Between-Token (TBT) (e.g., 6 seconds) from batching latter two issues above are mitigated but not resolved, and long-sequence prefills and decodes, and low throughput due SP's resultant KV value distribution across GPUs still requires to small batch size caused by constrained KV cache size and communication for decode, increasing TBT.
arXiv.org Artificial Intelligence
Sep-23-2024
- Country:
- South America > Chile
- North America > United States
- Virginia (0.04)
- New York > New York County
- New York City (0.04)
- Massachusetts > Suffolk County
- Boston (0.04)
- California
- Santa Clara County > Santa Clara (0.04)
- San Diego County > Carlsbad (0.04)
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > China
- Hong Kong (0.04)
- Guangxi Province > Nanning (0.04)
- Genre:
- Research Report (0.82)
- Technology: