FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference

Du, Hongchao, Wu, Shangyu, Kharlamova, Arina, Guan, Nan, Xue, Chun Jason

arXiv.org Artificial Intelligence 

Although these approaches can improve models' Large Language Models (LLMs) face challenges for on-device memory efficiency, they inevitably impact the generality inference due to high memory demands. Traditional methods performance and still suffer in extreme resource-constrained to reduce memory usage often compromise performance scenarios [4, 9, 12]. Furthermore, these methods lack the flexibility and lack adaptability. We propose FlexInfer, an optimized to vary memory budgets or deployment constraints, offloading framework for on-device inference, addressing requiring adjusting the hyper-parameters, such as quantization these issues with techniques like asynchronous prefetching, or sparsity levels, offering limited choices, and imposing balanced memory locking, and flexible tensor preservation.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found