Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters

May-27-2025, 04:33:19 GMT–Neural Information Processing Systems

Large Language Models (LLMs) are typically harmless but remain vulnerable to carefully crafted prompts known as jailbreaks'', which can bypass protective measures and induce harmful behavior. Recent advancements in LLMs have incorporated moderation guardrails that can filter outputs, which trigger processing errors for certain malicious questions. Existing red-teaming benchmarks often neglect to include questions that trigger moderation guardrails, making it difficult to evaluate jailbreak effectiveness. To address this issue, we introduce JAMBench, a harmful behavior benchmark designed to trigger and evaluate moderation guardrails. JAMBench involves 160 manually crafted instructions covering four major risk categories at multiple severity levels.

large language model, moderation guardrail, natural language, (2 more...)

Neural Information Processing Systems

May-27-2025, 04:33:19 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)