SpaceServe: Spatial Multiplexing of Complementary Encoders and Decoders for Multimodal LLMs

Jun-18-2026, 10:34:49 GMT–Neural Information Processing Systems

Recent multimodal large language models (MLLMs) marry modality-specific vision or audio encoders with a shared text decoder. While the encoder is computeintensive but memory-light, the decoder is the opposite, yet state-of-the-art serving stacks still time-multiplex these complementary kernels, idling SMs or HBM in turn. We introduce SpaceServe, a serving system that space-multiplexes MLLMs: it decouples all modality encoders from the decoder, and co-locates them on the same GPU using fine-grained SM partitioning available in modern runtimes. A cost-model-guided Space-Inference Scheduler (SIS) dynamically assigns SM slices, while a Time-Windowed Shortest-Remaining-First (TWSRFT) policy batches encoder requests to minimise completion latency and smooth decoder arrivals. Evaluation shows that SpaceServe reduces time-per-output-token by 4.81 on average and up to 28.9 on Nvidia A100 GPUs.

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Jun-18-2026, 10:34:49 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.93)

Genre:
- Research Report > Experimental Study (1.00)

Industry:
- Information Technology (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found