FlexInfer: Breaking Memory Constraint via Flexible and Efficient Offloading for On-Device LLM Inference
Du, Hongchao, Wu, Shangyu, Kharlamova, Arina, Guan, Nan, Xue, Chun Jason
–arXiv.org Artificial Intelligence
Although these approaches can improve models' Large Language Models (LLMs) face challenges for on-device memory efficiency, they inevitably impact the generality inference due to high memory demands. Traditional methods performance and still suffer in extreme resource-constrained to reduce memory usage often compromise performance scenarios [4, 9, 12]. Furthermore, these methods lack the flexibility and lack adaptability. We propose FlexInfer, an optimized to vary memory budgets or deployment constraints, offloading framework for on-device inference, addressing requiring adjusting the hyper-parameters, such as quantization these issues with techniques like asynchronous prefetching, or sparsity levels, offering limited choices, and imposing balanced memory locking, and flexible tensor preservation.
arXiv.org Artificial Intelligence
Mar-4-2025
- Country:
- South America > Chile
- North America > United States
- District of Columbia > Washington (0.05)
- Texas > Travis County
- Austin (0.04)
- New York > New York County
- New York City (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- California
- San Diego County > Carlsbad (0.04)
- Los Angeles County > Long Beach (0.04)
- Europe
- Asia
- China > Hong Kong (0.05)
- British Indian Ocean Territory > Diego Garcia (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Genre:
- Research Report (1.00)
- Industry:
- Information Technology (1.00)
- Technology: