GRADIEND: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models

Feb-3-2025–arXiv.org Artificial Intelligence

We hypothesize that these gradients AI systems frequently exhibit and amplify social biases, contain valuable information for identifying and modifying including gender bias, leading to harmful consequences gender-specific features. Our method aims to learn a in critical areas. This study introduces a novel encoderdecoder feature neuron that encodes gender information from the approach that leverages model gradients to input, i.e., model gradients. Unlike existing approaches learn a single monosemantic feature neuron encoding for extracting monosemantic features (e.g., Bricken et al. gender information. We show that our method can (2023)), our approach enables the learning of a feature neuron be used to debias transformer-based language models, with a desired, interpretable meaning, such as gender.

machine learning, monosemantic feature learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

Feb-3-2025

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia (0.04)
- North America
  - United States > Minnesota
    - Hennepin County > Minneapolis (0.14)
  - Canada
    - Quebec > Montreal (0.04)
    - Alberta > Census Division No. 15
      - Improvement District No. 9 > Banff (0.04)
- Europe
  - Germany (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)

Genre:
- Research Report > New Finding (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.66)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found