Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

May-28-2025, 11:26:46 GMT–Neural Information Processing Systems

While existing open moderation tools such as Llama-Guard2 [16] score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses.

category, large language model, machine learning, (21 more...)

Neural Information Processing Systems

May-28-2025, 11:26:46 GMT

Conferences PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report (0.92)

Industry:
- Health & Medicine > Therapeutic Area
  - Psychiatry/Psychology (0.92)
- Information Technology (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)