Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Neural Information Processing Systems 

While existing open moderation tools such as Llama-Guard2 [16] score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses.