Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models
Arzaghi', 'Mina, Farashah', 'Alireza Dehghanpour, Carichon', 'Florian, Farnadi', ' Golnoosh
–arXiv.org Artificial Intelligence
While prior studies have questioned whether intrinsic bias in LLMs affects fairness at the downstream task level, this work empirically investigates the connection. We present a unified evaluation framework to compare intrinsic bias mitigation via concept unlearning with extrinsic bias mitigation via counterfactual data augmentation (CDA). We examine this relationship through real-world financial classification tasks, including salary prediction, employment status, and creditworthiness assessment. Using three open-source LLMs, we evaluate models both as frozen embedding extractors and as fine-tuned classifiers. Our results show that intrinsic bias mitigation through unlearning reduces intrinsic gender bias by up to 94.9%, while also improving downstream task fairness metrics, such as demographic parity by up to 82%, without compromising accuracy. Our framework offers practical guidance on where mitigation efforts can be most effective and highlights the importance of applying early-stage mitigation before downstream deployment.
arXiv.org Artificial Intelligence
Sep-23-2025
- Country:
- North America
- Canada > Quebec
- Montreal (0.14)
- United States > Utah (0.04)
- Canada > Quebec
- North America
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Banking & Finance > Credit (0.68)
- Technology: