Goto

Collaborating Authors

 nsfw image


AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models

Wang, Yiming, Chen, Jiahao, Li, Qingming, Yang, Xing, Ji, Shouling

arXiv.org Artificial Intelligence

As text-to-image (T2I) models continue to advance and gain widespread adoption, their associated safety issues are becoming increasingly prominent. Malicious users often exploit these models to generate Not-Safe-for-Work (NSFW) images using harmful or adversarial prompts, highlighting the critical need for robust safeguards to ensure the integrity and compliance of model outputs. Current internal safeguards frequently degrade image quality, while external detection methods often suffer from low accuracy and inefficiency. In this paper, we introduce AEIOU, a defense framework that is Adaptable, Efficient, Interpretable, Optimizable, and Unified against NSFW prompts in T2I models. AEIOU extracts NSFW features from the hidden states of the model's text encoder, utilizing the separable nature of these features to detect NSFW prompts. The detection process is efficient, requiring minimal inference time. AEIOU also offers real-time interpretation of results and supports optimization through data augmentation techniques. The framework is versatile, accommodating various T2I architectures. Our extensive experiments show that AEIOU significantly outperforms both commercial and open-source moderation tools, achieving over 95% accuracy across all datasets and improving efficiency by at least tenfold. It effectively counters adaptive attacks and excels in few-shot and multi-label scenarios.


Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models

Ma, Jiachen, Cao, Anda, Xiao, Zhiqing, Zhang, Jie, Ye, Chao, Zhao, Junbo

arXiv.org Artificial Intelligence

Text-to-Image (T2I) models have received widespread attention due to their remarkable generation capabilities. However, concerns have been raised about the ethical implications of the models in generating Not Safe for Work (NSFW) images because NSFW images may cause discomfort to people or be used for illegal purposes. To mitigate the generation of such images, T2I models deploy various types of safety checkers. However, they still cannot completely prevent the generation of NSFW images. In this paper, we propose the Jailbreak Prompt Attack (JPA) - an automatic attack framework. We aim to maintain prompts that bypass safety checkers while preserving the semantics of the original images. Specifically, we aim to find prompts that can bypass safety checkers because of the robustness of the text space. Our evaluation demonstrates that JPA successfully bypasses both online services with closed-box safety checkers and offline defenses safety checkers to generate NSFW images.


The Dark Side of Open Source AI Image Generators

WIRED

Whether through the frowning high-definition face of a chimpanzee or a psychedelic, pink-and-red-hued doppelganger of himself, Reuven Cohen uses AI-generated images to catch people's attention. "I've always been interested in art and design and video and enjoy pushing boundaries," he says--but the Toronto-based consultant, who helps companies develop AI tools, also hopes to raise awareness of the technology's darker uses. "It can also be specifically trained to be quite gruesome and bad in a whole variety of ways," Cohen says. He's a fan of the freewheeling experimentation that has been unleashed by open source image-generation technology. But that same freedom enables the creation of explicit images of women used for harassment.


Removing NSFW Concepts from Vision-and-Language Models for Text-to-Image Retrieval and Generation

Poppi, Samuele, Poppi, Tobia, Cocchi, Federico, Cornia, Marcella, Baraldi, Lorenzo, Cucchiara, Rita

arXiv.org Artificial Intelligence

Vision-and-Language models such as CLIP have demonstrated remarkable effectiveness across a wide range of tasks. However, these models are typically trained on web-scale data, which can introduce inappropriate content and lead to the development of unsafe and biased behavior. This, in turn, hampers their applicability in sensitive and trustworthy contexts and could raise significant concern in their adoption. To overcome these limitations, we introduce a methodology to make Vision-and-Language models safer by removing their sensitivity to not-safe-for-work concepts. We show how this can be done by distilling from a large language model which converts between safe and unsafe sentences and which is fine-tuned starting from just 100 manually-curated pairs. We conduct extensive experiments on the resulting embedding space for both retrieval and text-to-image generation, where we show that our model can also be properly employed with pre-trained image generators. Our source code and trained models are available at: https://github.com/aimagelab/safe-clip.


SneakyPrompt: Jailbreaking Text-to-image Generative Models

Yang, Yuchen, Hui, Bo, Yuan, Haolin, Gong, Neil, Cao, Yinzhi

arXiv.org Artificial Intelligence

Text-to-image generative models such as Stable Diffusion and DALL$\cdot$E raise many ethical concerns due to the generation of harmful images such as Not-Safe-for-Work (NSFW) ones. To address these ethical concerns, safety filters are often adopted to prevent the generation of NSFW images. In this work, we propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models such that they generate NSFW images even if safety filters are adopted. Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter. Specifically, SneakyPrompt utilizes reinforcement learning to guide the perturbation of tokens. Our evaluation shows that SneakyPrompt successfully jailbreaks DALL$\cdot$E 2 with closed-box safety filters to generate NSFW images. Moreover, we also deploy several state-of-the-art, open-source safety filters on a Stable Diffusion model. Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models, in terms of both the number of queries and qualities of the generated NSFW images. SneakyPrompt is open-source and available at this repository: \url{https://github.com/Yuchen413/text2image_safety}.


nsfw-filter/nsfw-filter

#artificialintelligence

The contents of /dist/src folder is automatically created when you run npm run build and should NOT be tampered. The model use for this project nsfwjs is stored in dist/models. This can be changed to use your own models built using TensorFlow JS. You can read the docs from TensorFlow to learn more. While making changes or adding new files, make sure to add it in the /dist folder and add it to /dist/manifest.json


Building a safer Internet for everyone using AI

#artificialintelligence

All the source code used in this project is available here. The Internet is an unfiltered place. There is no guarantee what you would stumble across while you are casually scrolling through your feeds. You could stumble across inappropriate or "Not-Safe-For-Work" images even in unassuming places on the Interweb. This led me to think of a solution that could filter out such content from the web.


navendu-pottekkat/nsfw-filter

#artificialintelligence

The contents of /dist/src folder is automatically created when you run npm run-script build and should NOT be tampered. The model use for this project nsfwjs is stored in dist/models. This can be changed to use your own models built using TensorFlow JS. You can read the docs from TensorFlow to learn more. While making changes or adding new files, make sure to add it in the /dist folder and add it to /dist/manifest.json


What the data says about Valentine's Day and chatbots

#artificialintelligence

In honor of Valentine's Day, we took a look at the difference in usage between men and women on Facebook Messenger bots over the past two months. On average, about 63 percent of bot users are men and 28 percent are women, with the remaining 9 percent unknown or undetected. Men tend to be more engaged than women in terms of sessions per user per month -- on average they have about 50 percent more sessions. Women, however, tend to message more per session – about 12 percent more messages in each session. In regards to the number of bots men and women use: While the vast majority of users have only used one bot, about 14 percent of men and 10.6 percent of women use more than one bot.