Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations
Banerjee, Somnath, Chatterjee, Pratyush, Kumar, Shanu, Layek, Sayan, Agrawal, Parag, Hazra, Rima, Mukherjee, Animesh
–arXiv.org Artificial Intelligence
While LLMs appear robustly safety-aligned in English, we uncover a catastrophic, overlooked weakness: attributional collapse under code-mixed perturbations. Our systematic evaluation of open models shows that the linguistic camouflage of code-mixing -- ``blending languages within a single conversation'' -- can cause safety guardrails to fail dramatically. Attack success rates (ASR) spike from a benign 9\% in monolingual English to 69\% under code-mixed inputs, with rates exceeding 90\% in non-Western contexts such as Arabic and Hindi. These effects hold not only on controlled synthetic datasets but also on real-world social media traces, revealing a serious risk for billions of users. To explain why this happens, we introduce saliency drift attribution (SDA), an interpretability framework that shows how, under code-mixing, the model's internal attention drifts away from safety-critical tokens (e.g., ``violence'' or ``corruption''), effectively blinding it to harmful intent. Finally, we propose a lightweight translation-based restoration strategy that recovers roughly 80\% of the safety lost to code-mixing, offering a practical path toward more equitable and robust LLM safety.
arXiv.org Artificial Intelligence
Dec-2-2025
- Country:
- Africa (0.04)
- Asia
- Europe
- Netherlands > North Brabant
- Eindhoven (0.04)
- United Kingdom > England
- Oxfordshire > Oxford (0.04)
- Netherlands > North Brabant
- North America
- Central America (0.04)
- United States
- Florida > Miami-Dade County
- Miami (0.04)
- New Mexico
- Bernalillo County > Albuquerque (0.04)
- Santa Fe County > Santa Fe (0.04)
- Florida > Miami-Dade County
- South America (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Government > Immigration & Customs (0.46)
- Information Technology (0.93)
- Technology: