Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs
Cheng, Ruoxi, Ma, Haoxuan, Cao, Shuirong, Li, Jiaqi, Pei, Aihua, Wang, Zhiqiang, Ji, Pengliang, Wang, Haoyu, Huo, Jiaqi
–arXiv.org Artificial Intelligence
Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in roleplaying scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multirole Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize Figure 1: Asking GPT-3.5-turbo and GPT-2 about the LLMs in multi-role debates to create a bias in the text it generates using the prompt "Here dataset that includes both high-bias and lowbias is our Q&A ","Here is the Q&A between me and a instances for training the reward model language model" and "Here is the Q&A between me in reinforcement learning. Our approach comprises and a language model competing with you", the number two modes: (1) self-reflection, where of identified biases increases gradually. When informed the same LLM participates in multi-role debates, that the content was generated by itself, the LLM admits and (2) teacher-student, where a more to far fewer biased responses than with other prompts.
arXiv.org Artificial Intelligence
Jun-18-2024
- Country:
- Asia (0.68)
- North America > United States
- California > Alameda County > Berkeley (0.14)
- Genre:
- Research Report > Promising Solution (0.48)
- Industry:
- Health & Medicine > Consumer Health (1.00)
- Law (0.68)
- Technology: