Splitwiser: Efficient LM inference with constrained resources

Aali, Asad, Cardoza, Adney, Capo, Melissa

May-8-2025–arXiv.org Artificial Intelligence

--Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. T o address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface and vLLM. Generative Large Language Models (LLMs) have become essential in computing, offering vast capabilities in natural language processing. However, their widespread adoption has led to challenges, particularly in inference efficiency.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

May-8-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California
  - Santa Clara County > Palo Alto (0.04)
  - San Diego County > Carlsbad (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found