ShiQ: Bringing back Bellman to LLMs

Jun-13-2026, 10:30:40 GMT–Neural Information Processing Systems

The fine-tuning of pre-trained large language models (LLMs) using reinforcement learning (RL) is generally formulated as direct policy optimization. This approach was naturally favored as it efficiently improves a pretrained LLM with simple gradient updates. Another RL paradigm, Q-learning methods, has received far less attention in the LLM community while demonstrating major success in various non-LLM RL tasks. In particular, Q-learning effectiveness stems from its sample efficiency and ability to learn offline, which is particularly valuable given the high computational cost of sampling with LLM. However, naively applying a Q-learning-style update to the model's logits is ineffective due to the specificity of LLMs.

large language model, machine learning, reinforcement learning, (9 more...)

Neural Information Processing Systems

Jun-13-2026, 10:30:40 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Reinforcement Learning (1.00)