Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search

Mar-16-2025–arXiv.org Artificial Intelligence

We introduce Siege, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Siege expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Siege reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Siege achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Mar-16-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Monaco (0.04)
- Asia
  - British Indian Ocean Territory > Diego Garcia (0.04)
  - Middle East
    - Jordan (0.04)
    - Saudi Arabia > Asir Province
      - Abha (0.04)

Genre:
- Research Report (0.50)

Industry:
- Information Technology > Security & Privacy (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Search (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.72)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found