Preference Tuning For Toxicity Mitigation Generalizes Across Languages

Li, Xiaochen, Yong, Zheng-Xin, Bach, Stephen H.

Jun-23-2024–arXiv.org Artificial Intelligence

We investigate the mechanisms enabling crosslingual generalization of safety preference tuning. While significant resources have been allocated Recent work (Lee et al., 2024) shows that models to enhance the safety of large language models trained via DPO do not lose the ability to generate (LLMs) for deployment, safety of multilingual toxic content; instead, they learn to suppress the LLMs remains underexplored (Yong et al., 2023a; neuron activations that lead to toxicity, focusing on Deng et al., 2024). Recent work has shown that the role of key and value vectors in Multi-Layer multilingual LLMs have significant toxicity levels Perceptrons (MLP). While these findings explain and therefore highlights the need for multilingual DPO's effectiveness in the training language, they toxicity mitigation (Jain et al., 2024). However, to do not address its cross-lingual generalization. To reduce toxicity in open-ended generations in a non-bridge this gap, we extend the analysis to a multilingual English language X, current solutions (Pozzobon context, and we demonstrate that both key et al., 2024; Liu et al., 2021; Pozzobon et al., 2023; Dementieva et al., 2024) are resource-intensive as

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

Jun-23-2024

arXiv.org PDF

Add feedback

Country:
- Asia > Middle East
  - UAE (0.14)
- North America > Canada (0.14)

Genre:
- Research Report > New Finding (0.68)

Industry:
- Energy (0.46)
- Government (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found