Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Koo, Hamin, Kim, Minseon, Kim, Jaehyung
–arXiv.org Artificial Intelligence
Disclaimer: This paper contains potentially harmful or offensive content. Identifying the vulnerabilities of large language models (LLMs) is crucial for improving their safety by addressing inherent weaknesses. Jailbreaks, in which adversaries bypass safeguards with crafted input prompts, play a central role in red-teaming by probing LLMs to elicit unintended or unsafe behaviors. Recent optimization-based jailbreak approaches iteratively refine attack prompts by leveraging LLMs. However, they often rely heavily on either binary attack success rate (ASR) signals, which are sparse, or manually crafted scoring templates, which introduce human bias and uncertainty in the scoring outcomes. To address these limitations, we introduce AMIS (A lign to MISalign), a meta-optimization framework that jointly evolves jailbreak prompts and scoring templates through a bi-level structure. In the inner loop, prompts are refined using fine-grained and dense feedback using a fixed scoring template. In the outer loop, the template is optimized using an ASR alignment score, gradually evolving to better reflect true attack outcomes across queries. This co-optimization process yields progressively stronger jailbreak prompts and more calibrated scoring signals. Evaluations on AdvBench and JBB-Behaviors demonstrate that AMIS achieves state-of-the-art performance, including 88.0% ASR on Claude-3.5-Haiku As the deployment of large language models (LLMs) in real-world systems rapidly expands, ensuring their alignment and safety has become increasingly important (Zellers et al., 2019; Schuster et al., 2020; Lin et al., 2021). Despite substantial efforts to improve these aspects (Ouyang et al., 2022; Inan et al., 2023; Sharma et al., 2025), LLMs remain vulnerable in various ways, and one representative example of such risks is jailbreak attacks, where adversaries craft input prompts that bypass safeguards and trigger LLMs to generate harmful or disallowed outputs (Wei et al., 2023; Carlini et al., 2023; Ren et al., 2025). To prevent such techniques from being widely exploited by malicious actors, it is crucial to identify these vulnerabilities proactively and address them continuously in LLMs (Perez et al., 2022; Achiam et al., 2023; He et al., 2025). In this context, studying jailbreak attacks is therefore essential for exposing the weaknesses of current LLMs and hence for building more robust and trustworthy systems (Haider et al., 2024; Qi et al., 2024; Y u et al., 2023).
arXiv.org Artificial Intelligence
Nov-4-2025
- Country:
- Asia > Thailand
- Europe > Austria
- Vienna (0.14)
- North America > United States (0.04)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Government (0.67)
- Information Technology > Security & Privacy (1.00)
- Technology: