Goto

Collaborating Authors

 Chen, Jianfa


MLLM-as-a-Judge for Image Safety without Human Labeling

arXiv.org Artificial Intelligence

Image content safety has become a significant challenge with the rise of visual media on online platforms. Meanwhile, in the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content, such as images containing sexual or violent material. Thus, it becomes crucial to identify such unsafe images based on established safety rules. Pre-trained Multimodal Large Language Models (MLLMs) offer potential in this regard, given their strong pattern recognition abilities. Existing approaches typically fine-tune MLLMs with human-labeled datasets, which however brings a series of drawbacks. First, relying on human annotators to label data following intricate and detailed guidelines is both expensive and labor-intensive. Furthermore, users of safety judgment systems may need to frequently update safety rules, making fine-tuning on human-based annotation more challenging. This raises the research question: Can we detect unsafe images by querying MLLMs in a zero-shot setting using a predefined safety constitution (a set of safety rules)? Our research showed that simply querying pre-trained MLLMs does not yield satisfactory results. This lack of effectiveness stems from factors such as the subjectivity of safety rules, the complexity of lengthy constitutions, and the inherent biases in the models. To address these challenges, we propose a MLLM-based method includes objectifying safety rules, assessing the relevance between rules and images, making quick judgments based on debiased token probabilities with logically complete yet simplified precondition chains for safety rules, and conducting more in-depth reasoning with cascaded chain-of-thought processes if necessary. Experiment results demonstrate that our method is highly effective for zero-shot image safety judgment tasks.


Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation

arXiv.org Artificial Intelligence

Recent advances in Generative AI technology have enabled new generations of product applications, such as text generation OpenAI (2023); Anthropic (2023); Dubey (2024), text-to-image generation Ramesh et al. (2021); Dai et al. (2023); Rombach et al. (2022), and text-to-video generation Meta (2024). Consequently, the pace of model development must be matched by the development of safety systems which are properly equipped to mitigate novel harms, ensuring the system's overall integrity and preventing the use of Generative AI products from being exploited by bad actors to disseminate misinformation, glorify violence, and proliferate sexual content Foundation (2023). To achieve this goal, traditional model fine-tuning approaches are often employed, with classifiers learning patterns from labeled content moderation text data leveraged as guardrails OpenAI (2023). However, there are many challenges associated with automating content moderation with fine-tuning. First, content moderation is a highly subjective task, meaning that inter-annotator agreement in labeled data is low, due to different interpretations of policy guidelines, especially on borderline cases Markov et al. (2023). Second, it is impossible to enforce a universal taxonomy of harm, not only due to the subjectivity of the task, but due to the impact of systems scaling to new locales, new audiences, and new use cases, with different guidelines and different gradients of harm defined on those guidelines Shen et al. (2024). Third, the fine-tuning development cycle, which encompasses data collection, annotation, and model experimentation, is not ideally suited to the content moderation domain, where mitigations must land as quickly as possible once vulnerabilities are established. To address these challenges of subjectivity and inflexibility as a result of scale, we propose a Classification approach to content moderation which employs Retrieval-Augmented Generation (Class-RAG) to add context to elicit reasoning for content classification. While RAG Lewis et al. (2020) is often used for knowledge-intensive tasks where factual citation is key, we find that a RAG-based solution offers a distinct value proposition for the classification task of content moderation, not only due to its ability to enhance accuracy with few-shot learning, but because of its ability to make real-time knowledge updates, which is critical in our domain for