QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

Huang, Wei, Ge, Yi, Yang, Shuai, Xiao, Yicheng, Mao, Huizi, Lin, Yujun, Ye, Hanrong, Liu, Sifei, Cheung, Ka Chun, Yin, Hongxu, Lu, Yao, Qi, Xiaojuan, Han, Song, Chen, Yukang

arXiv.org Artificial Intelligence 

We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MA TH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.Figure 1: Rollout speedup and accuracy of QeRL on Qwen2.5-7B-Instruct. QeRL achieves faster RL rollout and end-to-end training speeds (batch=8), while delivering performance superior to vanilla LoRA and QLoRA, also comparable to full-parameter RL on mathematical benchmarks. The ability to perform multi-step reasoning is critical for large language models (LLMs) to handle complex tasks, from theoretical problem solving to practical decision making (Sui et al., 2025; Xu et al., 2025; Chu et al., 2025; Y ang et al., 2021). Supervised fine-tuning (SFT) is a common method to improve reasoning by training models to replicate explicit reasoning steps (Huang et al., 2024d; Min et al., 2024). In contrast, reinforcement learning (RL) uses verifiable reward signals to support adaptive learning, allowing models to explore diverse reasoning traces and identify more robust solutions (Lambert et al., 2024; DeepSeek-AI, 2025; Chen et al., 2025a). 1 AQN dynamically adjusts quantization noise with an exponential scheduler, enhancing exploration.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found