ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep Generation
–Neural Information Processing Systems
DALL E) can produce high-quality images based on input language descriptions. These models incorporate a black-box safety filter to prevent the generation of unsafe or unethical content, such as violent, criminal, or hateful imagery. Recent jailbreaking methods generate adversarial prompts capable of bypassing safety filters and producing unsafe content, exposing vulnerabilities in influential commercial models. However, once these adversarial prompts are identified, the safety filter can be updated to prevent the generation of unsafe images. In this work, we propose an effective, simple, and difficult-to-detect jailbreaking solution: generating safe content initially with normal text prompts and then editing the generations to embed unsafe content.
Neural Information Processing Systems
May-29-2025, 22:57:48 GMT
- Genre:
- Research Report
- Experimental Study (0.93)
- New Finding (0.93)
- Research Report
- Industry:
- Consumer Products & Services (0.68)
- Information Technology > Security & Privacy (0.93)
- Technology: