An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits
Steinmetz, Cody, Childress, Gavin, Herbst, Aaron, Jones, Gavin, Singh, Jasdeep, Vang, Eli, Weinstock, Keagan
–arXiv.org Artificial Intelligence
Large language models (LLMs) have transformed natural-language processing, yet their scale makes real-world deployment costly. Post-training quantization reduces memory and computation but often degrades accuracy, while quantization-aware training can recover performance at the cost of extra training. Pushing quantization to the ternary (2-bit) regime yields even larger savings but is notoriously unstable. Building on recent work showing that a bias-free, RMS-normalized Transformer with straight-through estimation can reach 1.58-bit precision, we demonstrate that simply inserting RMS normalization before every linear projection and applying a gradual, layer-wise quantization schedule stably fine-tunes full-precision checkpoints into ternary LLMs. Our approach matches or surpasses more elaborate knowledge-distillation pipelines on standard language-modeling benchmarks without adding model complexity. These results indicate that careful normalization alone can close much of the accuracy gap between ternary and full-precision LLMs, making ultra-low-bit inference practical.
arXiv.org Artificial Intelligence
May-15-2025
- Country:
- North America > United States > Wisconsin > Milwaukee County > Milwaukee (0.05)
- Genre:
- Research Report > New Finding (0.47)
- Technology: