MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

Wang, Zhaode, Yang, Jingbang, Qian, Xinyu, Xing, Shiwen, Jiang, Xiaotang, Lv, Chengfei, Zhang, Shengyu

Jun-13-2025–arXiv.org Artificial Intelligence

Large language models (LLMs) have demonstrated exceptional performance across a variety of tasks. However, their substantial scale leads to significant computational resource consumption during inference, resulting in high costs. Consequently, edge device inference presents a promising solution. The primary challenges of edge inference include memory usage and inference speed. This paper introduces MNN-LLM, a framework specifically designed to accelerate the deployment of large language models on mobile devices. MNN-LLM addresses the runtime characteristics of LLMs through model quantization and DRAM-Flash hybrid storage, effectively reducing memory usage. It rearranges weights and inputs based on mobile CPU instruction sets and GPU characteristics while employing strategies such as multicore load balancing, mixed-precision floating-point operations, and geometric computations to enhance performance. Notably, MNN-LLM achieves up to a 8.6x speed increase compared to current mainstream LLM-specific frameworks.

large language model, machine learning, quantization, (19 more...)

arXiv.org Artificial Intelligence

Jun-13-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.30)

Genre:
- Research Report (0.86)

Industry:
- Information Technology (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found