BingoGuard: LLM Content Moderation Tools with Risk Levels

Yin, Fan, Laban, Philippe, Peng, Xiangyu, Zhou, Yilun, Mao, Yixin, Vats, Vaibhav, Ross, Linnea, Agarwal, Divyansh, Xiong, Caiming, Wu, Chien-Sheng

Mar-9-2025–arXiv.org Artificial Intelligence

Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3\%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses.

classification, instruction, severity level, (15 more...)

arXiv.org Artificial Intelligence

Mar-9-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.04)
- Oceania > Australia (0.04)
- North America > United States
  - Connecticut (0.04)
  - California > Los Angeles County
    - Los Angeles (0.14)

Genre:
- Instructional Material (1.00)
- Research Report (0.82)

Industry:
- Media > News (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law > Criminal Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government > Military (1.00)
- Education (1.00)
- Health & Medicine
  - Consumer Health (0.94)
  - Therapeutic Area
    - Infections and Infectious Diseases (1.00)
    - Psychiatry/Psychology > Mental Health (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found