MirrorGuard: Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting

Pu, Rui, Li, Chaozhuo, Ha, Rui, Zhang, Litian, Qiu, Lirong, Zhang, Xi

Mar-17-2025–arXiv.org Artificial Intelligence

Defending large language models (LLMs) against jailbreak attacks is crucial for ensuring their safe deployment. Existing defense strategies generally rely on predefined static criteria to differentiate between harmful and benign prompts. However, such rigid rules are incapable of accommodating the inherent complexity and dynamic nature of real jailbreak attacks. In this paper, we propose a novel concept of ``mirror'' to enable dynamic and adaptive defense. A mirror refers to a dynamically generated prompt that mirrors the syntactic structure of the input while ensuring semantic safety. The personalized discrepancies between the input prompts and their corresponding mirrors serve as the guiding principles for defense. A new defense paradigm, MirrorGuard, is further proposed to detect and calibrate risky inputs based on such mirrors. An entropy-based detection metric, Relative Input Uncertainty (RIU), is integrated into MirrorGuard to quantify the discrepancies between input prompts and mirrors. MirrorGuard is evaluated on several popular datasets, demonstrating state-of-the-art defense performance while maintaining general effectiveness.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

Mar-17-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - Washington > King County
      - Seattle (0.04)
    - Massachusetts > Middlesex County
      - Cambridge (0.04)
    - Louisiana > Orleans Parish
      - New Orleans (0.04)
    - Illinois > Cook County
      - Chicago (0.04)
    - Hawaii > Honolulu County
      - Honolulu (0.04)
    - Florida > Miami-Dade County
      - Miami (0.04)
  - Mexico > Mexico City
    - Mexico City (0.04)
- Europe
  - Middle East > Malta (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
- Asia
  - Taiwan > Taiwan Province
    - Taipei (0.04)
  - China > Beijing
    - Beijing (0.04)

Genre:
- Research Report > Promising Solution (0.34)

Industry:
- Information Technology > Security & Privacy (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (0.95)
  - Machine Learning > Neural Networks
    - Deep Learning (0.70)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found