DRLC: Reinforcement Learning with Dense Rewards from LLM Critic

Cao, Meng, Shu, Lei, Yu, Lei, Zhu, Yun, Wichers, Nevan, Liu, Yinxiao, Meng, Lei

Jan-14-2024–arXiv.org Artificial Intelligence

Reinforcement learning (RL) can align language models with non-differentiable reward signals, such as human preferences. However, a major challenge arises from the sparsity of these reward signals - typically, there is only one reward for the entire generation. This sparsity of rewards can lead to inefficient and unstable learning. In this paper, we introduce a novel framework leveraging the critique ability of LLMs to produce dense rewards throughout the learning process. Our approach incorporates a critic language model alongside the policy model. This critic is prompted with the task description, question, policy model's output, and environment's reward signal as input, and provides token or span-level dense rewards that reflect the quality of each segment of the output. We assess our approach on three text generation tasks: sentiment control, language model detoxification, and summarization. Experimental results show that incorporating artificial dense rewards in training yields consistent performance gains over the PPO baseline with holistic rewards. Furthermore, in a setting where the same model serves as both policy and critic, we demonstrate that "self-critique" rewards also boost learning efficiency.

computational linguistic, language model, reward signal, (14 more...)

arXiv.org Artificial Intelligence

Jan-14-2024

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Oregon (0.04)
    - New York (0.04)
    - Texas > Travis County
      - Austin (0.04)
    - California > Los Angeles County
      - Los Angeles (0.04)
  - Puerto Rico > San Juan
    - San Juan (0.04)
  - Canada
    - Ontario > Toronto (0.14)
    - Quebec > Montreal (0.04)
    - British Columbia > Metro Vancouver Regional District
      - Vancouver (0.14)
- Europe
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Germany > Saarland
    - Saarbrücken (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - South Korea (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)

Genre:
- Research Report > New Finding (0.48)

Industry:
- Leisure & Entertainment (0.67)
- Media > Film (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks > Deep Learning (1.00)