Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages -- A Singlish Case Study
Lim, Isaac, Khoo, Shaun, Chua, Watson, Jiayi, Goh, Foo, Jessica
–arXiv.org Artificial Intelligence
To ensure safe usage, Large Language Models (LLMs) typically undergo alignment with human-defined values. However, this alignment often relies on primarily English data and is biased towards Western-centric values, limiting its effectiveness in low-resource language settings. In this paper, we describe our approach for aligning SEA-Lion-v2.1-Instruct (a Llama3-8B variant) to minimize toxicity in Singlish, an English creole specific to Singapore. We find that supervised fine-tuning and Kahneman-Tversky Optimization (KTO) on paired and unpaired preferences is more sample efficient and yields significantly better results than Direct Preference Optimization (DPO). Our analysis reveals that DPO implicitly enforces a weaker safety objective than KTO, and that SFT complements KTO by improving training stability. Finally, we introduce a simple but novel modification to KTO, KTO-S, which improves training stability through better gradient exploitation. Overall, we present a general approach for safety alignment conducive to low-resource English languages, successfully reducing toxicity by 99\% on our Singlish benchmark, with gains generalizing to the broader TOXIGEN dataset while maintaining strong performance across standard LLM benchmarks.
arXiv.org Artificial Intelligence
Feb-17-2025
- Country:
- Africa > Zambia
- Southern Province > Choma (0.04)
- Asia
- China (0.04)
- India (0.04)
- Japan > Honshū
- Chūbu > Toyama Prefecture > Toyama (0.04)
- Middle East
- Israel (0.04)
- Jordan (0.04)
- Saudi Arabia > Asir Province
- Abha (0.04)
- Myanmar > Tanintharyi Region
- Dawei (0.04)
- Singapore (0.25)
- Europe > Monaco (0.04)
- Africa > Zambia
- Genre:
- Research Report (1.00)
- Industry:
- Technology: