LEACE: Perfect linear concept erasure in closed form
Belrose, Nora, Schneider-Joseph, David, Ravfogel, Shauli, Cotterell, Ryan, Raff, Edward, Biderman, Stella
–arXiv.org Artificial Intelligence
Concept erasure aims to remove specified features from a representation. It can improve fairness (e.g. preventing a classifier from using gender or race) and interpretability (e.g. removing a concept to observe changes in model behavior). We introduce LEAst-squares Concept Erasure (LEACE), a closed-form method which provably prevents all linear classifiers from detecting a concept while changing the representation as little as possible, as measured by a broad class of norms. We apply LEACE to large language models with a novel procedure called "concept scrubbing," which erases target concept information from every layer in the network. We demonstrate our method on two tasks: measuring the reliance of language models on part-of-speech information, and reducing gender bias in BERT embeddings. Code is available at https://github.com/EleutherAI/concept-erasure.
arXiv.org Artificial Intelligence
Oct-29-2023
- Country:
- North America
- Dominican Republic (0.04)
- United States
- New York > New York County
- New York City (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.04)
- New York > New York County
- Canada > Ontario
- Toronto (0.04)
- Europe
- United Kingdom > England
- Cambridgeshire > Cambridge (0.14)
- Oxfordshire > Oxford (0.04)
- Switzerland > Zürich
- Zürich (0.04)
- France > Provence-Alpes-Côte d'Azur
- Bouches-du-Rhône > Marseille (0.04)
- Croatia > Dubrovnik-Neretva County
- Dubrovnik (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- United Kingdom > England
- North America
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Media (0.68)
- Health & Medicine (0.68)
- Leisure & Entertainment (0.68)
- Technology: