Investigating Gender Bias in Language Models Using Causal Mediation Analysis

Oct-10-2024, 19:06:30 GMT–Neural Information Processing Systems

Many interpretation methods for neural models in natural language processing investigate how information is encoded inside hidden representations. However, these methods can only measure whether the information exists, not whether it is actually used by the model. We propose a methodology grounded in the theory of causal mediation analysis for interpreting which parts of a model are causally implicated in its behavior. The approach enables us to analyze the mechanisms that facilitate the flow of information from input to output through various model components, known as mediators. As a case study, we apply this methodology to analyzing gender bias in pre-trained Transformer language models.

causal mediation analysis, gender bia, language model, (1 more...)

Neural Information Processing Systems

Oct-10-2024, 19:06:30 GMT

Conferences Web Page

Add feedback

Industry:
- Law > Alternative Dispute Resolution (0.71)

Technology:
- Information Technology > Artificial Intelligence > Natural Language (1.00)