Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
–Neural Information Processing Systems
While existing open moderation tools such as Llama-Guard2 [16] score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses.
Neural Information Processing Systems
May-28-2025, 11:26:46 GMT