Mitigating Jailbreaks with Intent-Aware LLMs
Yeo, Wei Jie, Satapathy, Ranjan, Cambria, Erik
–arXiv.org Artificial Intelligence
Despite extensive safety-tuning, large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions, reflecting a persistent trade-off between safety and task performance. We comprehensively evaluate both parametric and non-parametric attacks across open-source and proprietary models, considering harmfulness from attacks, utility, over-refusal, and impact against white-box threats. Importantly, our method preserves the model's general capabilities and reduces excessive refusals on benign instructions containing superficially harmful keywords. With the rapid advancement of large language models (LLMs) (Grattafiori et al., 2024; Y ang et al., 2025; Liu et al., 2024; Mao et al., 2024), the risk of these models executing harmful or catastrophic instructions has grown correspondingly (Anthropic, 2025). This is largely managed by efforts such as dedicated a safety-alignment stage (Ouyang et al., 2022), aiming to ensure that LLMs are not only helpful but also consistently generate safe and ethical outputs. Nevertheless, recent findings by Qi et al. (2024) expose a fundamental vulnerability in prevailing safety-alignment practices: Shallow Alignment . In particular, alignment in most models is largely superficial--constrained to surface-level refusals--resulting in safe outputs that are often limited to generic templates such as "I am sorry but... " or "As a language model... " . This superficial alignment permits attackers to circumvent safety mechanisms by explicitly instructing the model to avoid generating commonly recognized refusal responses (Tang, 2024; Andriushchenko et al., 2025). Furthermore, LLMs remain susceptible to a broader range of prompt-based attacks, including those that optimize over discrete suffix tokens (Zou et al., 2023; Basani & Zhang, 2025) or rephrase harmful instructions to look harmless (Chao et al., 2025; Zeng et al., 2024). Beyond initial safety alignment, practitioners have developed a range of inference-time defenses, such as prompting models to adhere to their safety guidelines (Xie et al., 2023) incorporating additional safety exemplars to enable in-context defense (Wei et al., 2023). Wang et al. (2024) introduce a backdoor trigger into safety-aligned LLMs, serving as a covert prefix that elicits safety responses when detected, without affecting model behavior on benign queries. In recent works, Zhang et al. (2024) introduced a dual-stage prompting strategy, Intention Analysis (IA), which encourages LLMs to analyze the intent behind an instruction prior to generating a safe response.
arXiv.org Artificial Intelligence
Aug-26-2025
- Country:
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Genre:
- Research Report (1.00)
- Industry:
- Government > Military (0.69)
- Information Technology > Security & Privacy (0.47)
- Technology: