PL-Guard: Benchmarking Language Model Safety for Polish

Krasnodębska, Aleksandra, Seweryn, Karolina, Łukasik, Szymon, Kusa, Wojciech

Jun-23-2025–arXiv.org Artificial Intelligence

Despite increasing efforts to ensure the safety of large language models (LLMs), most existing safety assessments and moderation tools remain heavily biased toward English and other high-resource languages, leaving majority of global languages underexamined. To address this gap, we introduce a manually annotated benchmark dataset for language model safety classification in Polish. We also create adversarially perturbed variants of these samples designed to challenge model robustness. We conduct a series of experiments to evaluate LLM-based and classifier-based models of varying sizes and architectures. Specifically, we fine-tune three models: Llama-Guard-3-8B, a HerBERT-based classifier (a Polish BERT derivative), and PLLuM, a Polish-adapted Llama-8B model. We train these models using different combinations of annotated data and evaluate their performance, comparing it against publicly available guard models. Results demonstrate that the HerBERT-based classifier achieves the highest overall performance, particularly under adversarial conditions.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jun-23-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - New Mexico > Bernalillo County > Albuquerque (0.04)
- Europe
  - Ukraine (0.04)
  - Poland > Masovia Province
    - Warsaw (0.04)
- Asia
  - Sri Lanka > Central Province
    - Kandy District > Kandy (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre:
- Research Report > New Finding (0.66)

Industry:
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (0.68)
- Information Technology (0.68)
- Law (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found