Collapsed Language Models Promote Fairness

Xu, Jingxuan, Chen, Wuyang, Li, Linyi, Zhao, Yao, Wei, Yunchao

arXiv.org Artificial Intelligence 

To mitigate societal biases implicitly encoded in recent successful pretrained language models, a diverse array of approaches have been proposed to encourage model fairness, focusing on prompting, data augmentation, regularized finetuning, and more. Despite the development, it is nontrivial to reach a principled understanding of fairness and an effective algorithm that can consistently debias language models. In this work, by rigorous evaluations of Neural Collapse - a learning phenomenon happen in last-layer representations and classifiers in deep networks - on fairness-related words, we find that debiased language models exhibit collapsed alignment between token representations and word embeddings. More importantly, this observation inspires us to design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods, while still preserving the performance of language models on standard natural language understanding tasks. The rise of pre-trained language models (PLMs) has revolutionized natural language processing, greatly enhancing tasks like reasoning and prediction by harnessing the semantic richness of language data. Despite their effectiveness, these models, trained on extensive corpora, often reflect and even intensify societal biases in their training datasets. Such biases manifest in the association of demographic groups with specific roles or capabilities, affecting fairness in applications ranging from legal analytics to hiring processes [49; 12; 38; 2; 52; 3; 7]. Thus, it is crucial to address and mitigate these biases to prevent discriminatory practices in downstream applications [70; 64; 46].