PL-Guard: Benchmarking Language Model Safety for Polish
Krasnodębska, Aleksandra, Seweryn, Karolina, Łukasik, Szymon, Kusa, Wojciech
–arXiv.org Artificial Intelligence
Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.
arXiv.org Artificial Intelligence
Jun-23-2025
- Country:
- Asia
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.14)
- Sri Lanka > Central Province
- Kandy District > Kandy (0.04)
- Middle East > UAE
- Europe
- Poland > Masovia Province
- Warsaw (0.04)
- Ukraine (0.04)
- Poland > Masovia Province
- North America > United States
- New Mexico > Bernalillo County > Albuquerque (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Information Technology (0.68)
- Law (0.46)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
- Technology: