Adversarial Reasoning at Jailbreaking Time
Sabbaghi, Mahdi, Kassianik, Paul, Pappas, George, Singer, Yaron, Karbasi, Amin, Hassani, Hamed
–arXiv.org Artificial Intelligence
As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation that achieves SOTA attack success rates (ASR) against many aligned LLMs, even the ones that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv.org Artificial Intelligence
Feb-3-2025
- Genre:
- Research Report > New Finding (0.67)
- Industry:
- Government > Military (0.46)
- Information Technology > Security & Privacy (0.68)
- Technology: