End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost

Tan, Qitao, Song, Xiaoying, Lu, Jin, Li, Guoming, Liu, Jun, Hong, Lingzi, Ding, Caiwen, Li, Jundong, Zhai, Xiaoming, Huang, Shaoyi, Niu, Wei, Yuan, Geng

arXiv.org Artificial Intelligence 

Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their inability to fine-tune model parameters and often suffer significant accuracy loss in low-bit scenarios. Quantization-aware training (QA T) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs, limiting its practicality for LLM deployment. To address these challenges, we propose ZeroQA T, a zeroth-order optimization-based QA T framework that supports both weight and activation quantization. ZeroQA T leverages forward-only gradient estimation to eliminate backpropagation, substantially reducing computational and memory overhead while retaining the benefits of end-to-end optimization. We further introduce a lightweight variant of ZeroQA T for quantized fine-tuning, which freezes and pre-quantizes most parameters to further cut memory usage. Experiments show that ZeroQA T consistently outperforms representative PTQ and QA T baselines while requiring significantly less memory. For example, ZeroQA T enables fine-tuning of a 13B model at extremely low bit-widths (e.g., 2-4 bits) on a single 8GB GPU, and even allows fine-tuning a 6.7B model on a OnePlus 12 smartphone, demonstrating its practicality for end-to-end QA T on resource-limited edge devices. Large language models (LLMs) have emerged as essential tools for advancing natural language understanding and generation, driving progress in both research and industrial applications (Y ang et al., 2019; Liu et al., 2019; Talmor et al., 2018; Chowdhery et al., 2023; Zheng et al., 2020). Despite their transformative potential, training and deploying these models incur extremely high computational and memory costs. Such requirements not only constrain accessibility and scalability but also limit practicality in resource-constrained environments, including mobile and edge devices, embedded systems, and even enterprise servers with strict hardware or budget limitations (Zeng et al., 2024; Chen et al., 2024a; Tan et al., 2025). To address these challenges, model compression has been widely studied, with quantization being one of the most effective and indispensable techniques for deployment.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found