LionGuard: Building a Contextualized Moderation Classifier to Tackle Localized Unsafe Content
–arXiv.org Artificial Intelligence
As large language models (LLMs) become increasingly prevalent in a wide variety of applications, concerns about the safety of their outputs have become more significant. Most efforts at safety-tuning or moderation today take on a predominantly Western-centric view of safety, especially for toxic, hateful, or violent speech. In this paper, we describe LionGuard, a Singapore-contextualized moderation classifier that can serve as guardrails against unsafe LLM outputs. When assessed on Singlish data, LionGuard outperforms existing widely-used moderation APIs, which are not finetuned for the Singapore context, by 14% (binary) and up to 51% (multi-label). Our work highlights the benefits of localization for moderation classifiers and presents a practical and scalable approach for low-resource languages.
arXiv.org Artificial Intelligence
Jun-24-2024
- Country:
- North America
- United States
- Texas > Travis County
- Austin (0.04)
- New York > New York County
- New York City (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Texas > Travis County
- Canada > Ontario
- Toronto (0.04)
- United States
- Europe
- Ireland (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Singapore (0.47)
- Taiwan (0.04)
- Japan (0.04)
- India > Tamil Nadu
- Chennai (0.04)
- China > Hubei Province
- Wuhan (0.04)
- North America
- Genre:
- Research Report (0.50)
- Industry:
- Law (1.00)
- Health & Medicine (0.94)
- Government (0.93)
- Information Technology (0.67)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.46)
- Technology: