Learning from Red Teaming: Gender Bias Provocation and Mitigation in Large Language Models

Su, Hsuan, Cheng, Cheng-Chu, Farn, Hua, Kumar, Shachi H, Sahay, Saurav, Chen, Shang-Tse, Lee, Hung-yi

Oct-17-2023–arXiv.org Artificial Intelligence

Recently, researchers have made considerable improvements in dialogue systems with the progress of large language models (LLMs) such as ChatGPT and GPT-4. These LLM-based chatbots encode the potential biases while retaining disparities that can harm humans during interactions. The traditional biases investigation methods often rely on human-written test cases. However, these test cases are usually expensive and limited. In this work, we propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias. We apply our method to three well-known LLMs and find that the generated test cases effectively identify the presence of biases. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning to circumvent the need for parameter fine-tuning. The experimental results show that LLMs generate fairer responses with the proposed approach.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Oct-17-2023

arXiv.org PDF

Add feedback

Country:
- Asia
  - China > Hong Kong (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)
  - Taiwan (0.04)
- Europe
  - Belgium (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Italy > Tuscany
    - Florence (0.04)
- North America > United States
  - Minnesota > Hennepin County
    - Minneapolis (0.14)
  - New York > New York County
    - New York City (0.04)
  - Virginia (0.04)

Genre:
- Research Report > New Finding (0.48)

Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found