ShiQ: Bringing back Bellman to LLMs
–Neural Information Processing Systems
The fine-tuning of pre-trained large language models (LLMs) using reinforcement learning (RL) is generally formulated as direct policy optimization. This approach was naturally favored as it efficiently improves a pretrained LLM with simple gradient updates. Another RL paradigm, Q-learning methods, has received far less attention in the LLM community while demonstrating major success in various non-LLM RL tasks. In particular, Q-learning effectiveness stems from its sample efficiency and ability to learn offline, which is particularly valuable given the high computational cost of sampling with LLM. However, naively applying a Q-learning-style update to the model's logits is ineffective due to the specificity of LLMs.
Neural Information Processing Systems
Jun-13-2026, 10:30:40 GMT
- Technology: