Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models
Li, Yakai, Hu, Jiekang, Sang, Weiduan, Ma, Luping, Nie, Dongsheng, Zhang, Weijuan, Yu, Aimin, Su, Yi, Huang, Qingjia, Zhou, Qihang
–arXiv.org Artificial Intelligence
Large Language Models face security threats from jailbreak attacks. Existing research has predominantly focused on prompt-level attacks while largely ignoring the underexplored attack surface of user-controlled response prefilling. This functionality allows an attacker to dictate the beginning of a model's output, thereby shifting the attack paradigm from persuasion to direct state manipulation.In this paper, we present a systematic black-box security analysis of prefill-level jailbreak attacks. We categorize these new attacks and evaluate their effectiveness across fourteen language models. Our experiments show that prefill-level attacks achieve high success rates, with adaptive methods exceeding 99% on several models. Token-level probability analysis reveals that these attacks work through initial-state manipulation by changing the first-token probability from refusal to compliance.Furthermore, we show that prefill-level jailbreak can act as effective enhancers, increasing the success of existing prompt-level attacks by 10 to 15 percentage points. Our evaluation of several defense strategies indicates that conventional content filters offer limited protection. We find that a detection method focusing on the manipulative relationship between the prompt and the prefill is more effective. Our findings reveal a gap in current LLM safety alignment and highlight the need to address the prefill attack surface in future safety training.
arXiv.org Artificial Intelligence
Aug-27-2025
- Country:
- Genre:
- Research Report > New Finding (0.66)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: