When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search Xuan Chen 1

May-28-2025, 23:39:14 GMT–Neural Information Processing Systems

Warning: This paper contains unfiltered and potentially harmful content. Recent studies developed jailbreaking attacks, which construct jailbreaking prompts to "fool" LLMs into responding to harmful questions. Early-stage jailbreaking attacks require access to model internals or significant human efforts. More advanced attacks utilize genetic algorithms for automatic and black-box attacks. However, the random nature of genetic algorithms significantly limits the effectiveness of these attacks.

large language model, machine learning, natural language, (18 more...)

Neural Information Processing Systems

May-28-2025, 23:39:14 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > California (0.14)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (1.00)

Industry:
- Government (0.67)
- Information Technology > Security & Privacy (1.00)
- Law (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Search (1.00)