Fusing Rewards and Preferences in Reinforcement Learning

Khorasani, Sadegh, Salehkaleybar, Saber, Kiyavash, Negar, Grossglauser, Matthias

Aug-18-2025–arXiv.org Artificial Intelligence

We present Dual-Feedback Actor (DFA), a reinforcement learning algorithm that fuses both individual rewards and pairwise preferences (if available) into a single update rule. DFA uses the policy's log-probabilities directly to model the preference probability, avoiding a separate reward-modeling step. Preferences can be provided by human-annotators (at state-level or trajectory-level) or be synthesized online from Q-values stored in an off-policy replay buffer. Under a Bradley-Terry model, we prove that minimizing DFA's preference loss recovers the entropy-regularized Soft Actor-Critic (SAC) policy. Our simulation results show that DFA trained on generated preferences matches or exceeds SAC on six control environments and demonstrates a more stable training process. With only a semi-synthetic preference dataset under Bradley-Terry model, our algorithm outperforms reward-modeling reinforcement learning from human feedback (RLHF) baselines in a stochastic GridWorld and approaches the performance of an oracle with true rewards.

arxiv preprint arxiv, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

Aug-18-2025

arXiv.org PDF

Add feedback

Country:
- Europe (0.93)

Genre:
- Research Report > New Finding (0.87)

Industry:
- Information Technology (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Uncertainty
    - Bayesian Inference (0.46)
  - Machine Learning
    - Reinforcement Learning (1.00)
    - Neural Networks > Deep Learning (0.70)
    - Learning Graphical Models > Directed Networks
      - Bayesian Learning (0.46)