Large Language Model Bias Mitigation from the Perspective of Knowledge Editing

Chen, Ruizhe, Li, Yichen, Xiao, Zikai, Liu, Zuozhu

arXiv.org Artificial Intelligence 

Existing debiasing methods inevitably make unreasonable or undesired predictions as they are designated and evaluated to achieve parity across different social groups but leave aside individual facts, resulting in modified existing knowledge. In this paper, we first establish a new bias mitigation benchmark BiasKE leveraging existing and additional constructed datasets, which systematically assesses debiasing performance by complementary metrics on fairness, specificity, and generalization. Meanwhile, we propose a novel debiasing method, Fairness Stamp (FAST), which enables editable fairness through fine-grained calibration on individual biased knowledge. Comprehensive experiments demonstrate that FAST surpasses state-of-the-art baselines with remarkable debiasing performance while not hampering overall model capability for knowledge preservation, highlighting the prospect of fine-grained debiasing strategies for editable fairness in LLMs. Pre-trained Large Language Models (LLMs) have demonstrated exceptional performance on many tasks (Devlin et al., 2018; Floridi & Chiriatti, 2020; Brown et al., 2020). However, the encoded social stereotypes and human-like biases inevitably cause undesired behaviors when deploying LLMs in practice (Zhao et al., 2019; Navigli et al., 2023; Sheng et al., 2021). Existing approaches to mitigate biases in LLMs are mainly categorized into: (1) Fine-tuning (Zmigrod et al., 2019; Webster et al., 2020; He et al., 2022; Liang et al., 2020; Lauscher et al., 2021), which includes techniques such as re-balanced corpus pre-training, contrastive learning, projection methods, and efficient parameter tuning. However, existing techniques treat social groups as interchangeable (Gallegos et al., 2023) and neutralize protected attributes of different social groups in model inputs or outputs, while ignoring or Furthermore, existing debiasing evaluation metrics mainly focus on the degree of bias, but fail to measure whether the model retains its origin knowledge (Gallegos et al., 2023) of discerning reasonable disparities among different social groups.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found