DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Kwon, Sangwoo, Seo, Seong Hoon, Lee, Jae W., Park, Yeonhong

Dec-9-2025–arXiv.org Artificial Intelligence

How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Dec-9-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report > Promising Solution (0.48)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found