Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning
Liu, Boyin, Zhang, Zhuo, Huang, Sen, Xie, Lipeng, Fu, Qingxu, Chen, Haoran, YU, LI, Hu, Tianyi, Liu, Zhaoyang, Ding, Bolin, Zhao, Dongbin
–arXiv.org Artificial Intelligence
Aligning language models using LLM judge feedback offers a scalable alternative to human annotation, yet is plagued by judgment inconsistencies that destabilize reinforcement learning. While prior work has focused on judge accuracy, the critical issue of logical coherence--particularly preference cycles (A B C A)--has been largely unaddressed. To address this gap, this work introduces an end-to-end framework to systematically detect and resolve these inconsistencies within the reinforcement learning training loop. Our framework features two core contributions: the Conflict Detection Rate (CDR), a novel metric to quantify judgment conflicts, and Deconflicted Graph Rewards (DGR), a signal-purification framework that eliminates cycles before policy optimization. DGR constructs preference graphs from raw judgments, transforms them into conflict-free Directed Acyclic Graphs (DAGs), and generates a logically coherent reward signal compatible with any policy optimizer. Experiments confirm that our framework significantly improves training stability and model performance over strong baselines, establishing logical consistency as a crucial and now-addressable dimension of AI feedback. Aligning large language models (LLMs) with human preferences, traditionally achieved through Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022), is critical for safe AI deployment. However, the reliance on costly and slow human annotation has created a scalability bottleneck, pushing the field towards Reinforcement Learning from AI Feedback (RLAIF) (Bai et al., 2022; Lee et al., 2023). Within RLAIF, the pairwise comparison paradigm--where an LLM judge selects the better of two responses--has become the de facto standard, prized for its intuitive nature and fine-grained feedback signal that underpins many state-of-the-art alignment techniques (Song et al., 2024; Wang et al., 2024). Recent advances in pairwise methods include the Pairwise-RL framework (Xu et al., 2025), which addresses the fundamental misalignment between generative base models and discriminative reward tasks by unifying reward model training and reinforcement learning application in a consistent pairwise paradigm. This framework combines generative reward modeling with pairwise policy optimization, leveraging generative modeling techniques to improve reward model performance and score calibration. Consequently, our work focuses on the pairwise paradigm, building upon these foundational approaches. However, this scalability comes at a hidden cost: the erosion of logical consistency.
arXiv.org Artificial Intelligence
Oct-22-2025