Robust Reinforcement Learning from Corrupted Human Feedback
–Neural Information Processing Systems
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e.g., personal bias, context ambiguity, lack of training, etc, human annotators may give incorrect or inconsistent preference labels.
Neural Information Processing Systems
Mar-27-2025, 12:02:21 GMT