Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Li, Xiaochen, Yong, Zheng-Xin, Bach, Stephen H.
–arXiv.org Artificial Intelligence
We investigate the mechanisms enabling crosslingual generalization of safety preference tuning. While significant resources have been allocated Recent work (Lee et al., 2024) shows that models to enhance the safety of large language models trained via DPO do not lose the ability to generate (LLMs) for deployment, safety of multilingual toxic content; instead, they learn to suppress the LLMs remains underexplored (Yong et al., 2023a; neuron activations that lead to toxicity, focusing on Deng et al., 2024). Recent work has shown that the role of key and value vectors in Multi-Layer multilingual LLMs have significant toxicity levels Perceptrons (MLP). While these findings explain and therefore highlights the need for multilingual DPO's effectiveness in the training language, they toxicity mitigation (Jain et al., 2024). However, to do not address its cross-lingual generalization. To reduce toxicity in open-ended generations in a non-bridge this gap, we extend the analysis to a multilingual English language X, current solutions (Pozzobon context, and we demonstrate that both key et al., 2024; Liu et al., 2021; Pozzobon et al., 2023; Dementieva et al., 2024) are resource-intensive as
arXiv.org Artificial Intelligence
Jun-23-2024
- Country:
- Asia > Middle East
- UAE (0.14)
- North America > Canada (0.14)
- Asia > Middle East
- Genre:
- Research Report > New Finding (0.68)
- Industry:
- Energy (0.46)
- Government (0.46)
- Technology: