Finding a Wolf in Sheep's Clothing: Combating Adversarial Text-To-Image Prompts with Text Summarization

Cooper, Portia, Narnoli, Harshita, Surdeanu, Mihai

Dec-15-2024–arXiv.org Artificial Intelligence

Text-to-image models are vulnerable to the stepwise "Divide-and-Conquer Attack" (DACA) that utilize a large language model to obfuscate inappropriate content in prompts by wrapping sensitive text in a benign narrative. To mitigate stepwise DACA attacks, we propose a two-layer method involving text summarization followed by binary classification. We assembled the Adversarial Text-to-Image Prompt (ATTIP) dataset ($N=940$), which contained DACA-obfuscated and non-obfuscated prompts. From the ATTIP dataset, we created two summarized versions: one generated by a small encoder model and the other by a large language model. Then, we used an encoder classifier and a GPT-4o classifier to perform content moderation on the summarized and unsummarized prompts. When compared with a classifier that operated over the unsummarized data, our method improved F1 score performance by 31%. Further, the highest recorded F1 score achieved (98%) was produced by the encoder classifier on a summarized ATTIP variant. This study indicates that pre-classification text summarization can inoculate content detection models against stepwise DACA obfuscations.

classifier, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

Dec-15-2024

arXiv.org PDF

Add feedback

Country:
- Europe > Monaco (0.04)
- North America > United States
  - New York > New York County
    - New York City (0.04)
  - Arizona > Pima County
    - Tucson (0.14)
- Asia
  - Middle East > Jordan (0.04)
  - China (0.04)

Genre:
- Research Report (0.82)

Industry:
- Media (0.46)
- Leisure & Entertainment (0.46)
- Information Technology (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.75)