Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment

Banerjee, Somnath, Layek, Sayan, Chatterjee, Pratyush, Mukherjee, Animesh, Hazra, Rima

Feb-16-2025–arXiv.org Artificial Intelligence

Ensuring consistent safety across multiple languages remains a significant challenge for large language models (LLMs). We introduce Soteria, a lightweight yet powerful strategy that locates and minimally adjusts the "functional heads" most responsible for harmful content generation in each language. By altering only a fraction of parameters, Soteria drastically reduces policy violations without sacrificing overall model performance, even in low-resource settings. To rigorously evaluate our approach, we also present XThreatBench, a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines. Experiments with leading open-source LLMs (e.g., Llama, Qwen, Mistral) show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages. These findings highlight a promising path toward scalable, linguistically attuned, and ethically aligned LLMs worldwide.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Feb-16-2025

arXiv.org PDF

Add feedback

Country:
- Asia (0.68)
- North America
  - Mexico (0.28)
  - United States (0.28)

Genre:
- Research Report (0.82)

Industry:
- Information Technology > Security & Privacy (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.71)
  - Natural Language > Large Language Model (1.00)