Low-Resource Languages Jailbreak GPT-4

Yong, Zheng-Xin, Menghini, Cristina, Bach, Stephen H.

Jan-27-2024–arXiv.org Artificial Intelligence

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBench benchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rates, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affected speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage. Content Warning: This paper contains examples of harmful language.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

Jan-27-2024

arXiv.org PDF

Add feedback

Country:
- Asia
  - China (0.14)
  - Middle East > UAE (0.14)
- Europe
  - Denmark (0.14)
  - Iceland (0.14)
  - Italy (0.14)
- North America > Canada (0.14)

Genre:
- Research Report > New Finding (0.47)

Industry:
- Law Enforcement & Public Safety (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)