DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Neural Information Processing Systems 

How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy?

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found