Robust Reinforcement Learning from Corrupted Human Feedback

May-27-2025, 19:33:23 GMT–Neural Information Processing Systems

Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may give incorrect or inconsistent preference labels. To tackle this challenge, we propose a robust RLHF approach -- R 3M, which models the potentially corrupted preference label as sparse outliers. Accordingly, we formulate the robust reward learning as an \ell_1 -regularized maximum likelihood estimation problem. Computationally, we develop an efficient alternating optimization algorithm, which only incurs negligible computational overhead compared with the standard RLHF approach.

artificial intelligence, deep learning, machine learning, (7 more...)

Neural Information Processing Systems

May-27-2025, 19:33:23 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Machine Learning
  - Neural Networks > Deep Learning (1.00)
  - Learning Graphical Models > Directed Networks
    - Bayesian Learning (0.63)