An indicator for effectiveness of text-to-image guardrails utilizing the Single-Turn Crescendo Attack (STCA)

Kwartler, Ted, Bagan, Nataliia, Banny, Ivan, Aqrawi, Alan, Abbasi, Arian

Nov-27-2024–arXiv.org Artificial Intelligence

The Single-Turn Crescendo Attack (STCA), first introduced in Aqrawi and Abbasi [2024], is an innovative method designed to bypass the ethical safeguards of text-to-text AI models, compelling them to generate harmful content. This technique leverages a strategic escalation of context within a single prompt, combined with trust-building mechanisms, to subtly deceive the model into producing unintended outputs. Extending the application of STCA to text-to-image models, we demonstrate its efficacy by compromising the guardrails of a widely-used model, DALL-E 3, achieving outputs comparable to outputs from the uncensored model Flux Schnell, which served as a baseline control. This study provides a framework for researchers to rigorously evaluate the robustness of guardrails in text-to-image models and benchmark their resilience against adversarial attacks.

guardrail, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-27-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Netherlands > South Holland > Leiden (0.04)

Genre:
- Research Report
  - Promising Solution (0.34)
  - Experimental Study (0.34)

Industry:
- Health & Medicine (0.95)
- Information Technology > Security & Privacy (0.88)
- Law (0.69)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.53)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found