When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search Xuan Chen 1

Neural Information Processing Systems 

Warning: This paper contains unfiltered and potentially harmful content. Recent studies developed jailbreaking attacks, which construct jailbreaking prompts to "fool" LLMs into responding to harmful questions. Early-stage jailbreaking attacks require access to model internals or significant human efforts. More advanced attacks utilize genetic algorithms for automatic and black-box attacks. However, the random nature of genetic algorithms significantly limits the effectiveness of these attacks.