Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs

Cheng, Ruoxi, Ma, Haoxuan, Cao, Shuirong, Li, Jiaqi, Pei, Aihua, Wang, Zhiqiang, Ji, Pengliang, Wang, Haoyu, Huo, Jiaqi

Jun-18-2024–arXiv.org Artificial Intelligence

Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in roleplaying scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multirole Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize Figure 1: Asking GPT-3.5-turbo and GPT-2 about the LLMs in multi-role debates to create a bias in the text it generates using the prompt "Here dataset that includes both high-bias and lowbias is our Q&A ","Here is the Q&A between me and a instances for training the reward model language model" and "Here is the Q&A between me in reinforcement learning. Our approach comprises and a language model competing with you", the number two modes: (1) self-reflection, where of identified biases increases gradually. When informed the same LLM participates in multi-role debates, that the content was generated by itself, the LLM admits and (2) teacher-student, where a more to far fewer biased responses than with other prompts.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jun-18-2024

arXiv.org PDF

Add feedback

Country:
- Asia (0.68)
- North America > United States
  - California > Alameda County > Berkeley (0.14)

Genre:
- Research Report > Promising Solution (0.48)

Industry:
- Health & Medicine > Consumer Health (1.00)
- Law (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.69)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found