GloSS over Toxicity: Understanding and Mitigating Toxicity in LLMs via Global Toxic Subspace
Duan, Zenghao, Yin, Zhiyi, Shi, Zhichao, Pang, Liang, Jing, Shaoling, Wu, Jiayi, Yan, Yu, Shen, Huawei, Cheng, Xueqi
–arXiv.org Artificial Intelligence
This paper investigates the underlying mechanisms of toxicity generation in Large Language Models (LLMs) and proposes an effective detoxification approach. Prior work typically considers the Feed-Forward Network (FFN) as the main source of toxicity, representing toxic regions as a set of toxic vectors or layer-wise subspaces. However, our in-depth analysis reveals that the global toxic subspace offers a more effective and comprehensive representation of toxic region within the model. Building on this insight, we propose GloSS (Global Toxic Subspace Suppression), a lightweight, four-stage method that mitigates toxicity by identifying and removing the global toxic subspace from the parameters of FFN. Experiments across a range of LLMs show that GloSS achieves state-of-the-art detoxification performance while preserving the models general capabilities, without requiring large-scale data or model retraining.
arXiv.org Artificial Intelligence
May-26-2025
- Country:
- Asia
- China
- Beijing > Beijing (0.04)
- Liaoning Province > Dalian (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Singapore (0.04)
- Thailand > Bangkok
- Bangkok (0.04)
- China
- Europe > Ireland
- Leinster > County Dublin > Dublin (0.04)
- North America > United States (0.14)
- Asia
- Genre:
- Research Report > New Finding (1.00)
- Technology: