Goto

Collaborating Authors

 moderation


Elon Musk's Grok 'Undressing' Problem Isn't Fixed

WIRED

X has placed more restrictions on Grok's ability to generate explicit AI images, but tests show that the updates have created a patchwork of limitations that fail to fully address the issue. Elon Musk's X has introduced new restrictions stopping people from editing and generating images of real people in bikinis or other "revealing clothing." The change in policy on Wednesday night follows global outrage at Grok being used to generate thousands of harmful non-consensual "undressing" photos of women and sexualized images of apparent minors on X. However, while it appears that some safety measures have finally been introduced to Grok's image generation on X, the standalone Grok app and website seem to still be able to generate "undress" style images and pornographic content, according to multiple tests by researchers, WIRED, and other journalists. Other users, meanwhile, say they're no longer to create images and videos as they once were.


Taxonomy-Adaptive Moderation Model with Robust Guardrails for Large Language Models

Nandwana, Mahesh Kumar, Lim, Youngwan, Liu, Joseph, Yang, Alex, Notibala, Varun, Khanna, Nishchaie

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are typically aligned for safety during the post-training phase; however, they may still generate inappropriate outputs that could potentially pose risks to users. This challenge underscores the need for robust safeguards that operate across both model inputs and outputs. In this work, we introduce Roblox Guard 1.0, a state-of-the-art instruction fine-tuned LLM designed to enhance the safety of LLM systems through comprehensive input-output moderation, using a pipeline of LLMs to enhance moderation capability. Built on the Llama-3.1-8B-Instruct backbone, our model is instruction fine-tuned to generalize across previously unseen safety taxonomies and demonstrates strong performance on out-of-domain safety benchmarks. The instruction fine-tuning process uses a mix of synthetic and open-source safety datasets, augmented with chain-of-thought (CoT) rationales and input inversion to enhance contextual understanding and decision making. To support systematic evaluation, we also release RobloxGuard-Eval, a new benchmark featuring an extensible safety taxonomy to assess the effectiveness of LLM guardrails and moderation frameworks.


When Harmful Content Gets Camouflaged: Unveiling Perception Failure of LVLMs with CamHarmTI

Li, Yanhui, Zhou, Qi, Xu, Zhihong, Guo, Huizhong, Wang, Wenhai, Wang, Dongxia

arXiv.org Artificial Intelligence

Large vision-language models (LVLMs) are increasingly used for tasks where detecting multimodal harmful content is crucial, such as online content moderation. However, real-world harmful content is often camouflaged, relying on nuanced text-image interplay, such as memes or images with embedded malicious text, to evade detection. This raises a key question: \textbf{can LVLMs perceive such camouflaged harmful content as sensitively as humans do?} In this paper, we introduce CamHarmTI, a benchmark for evaluating LVLM ability to perceive and interpret camouflaged harmful content within text-image compositions. CamHarmTI consists of over 4,500 samples across three types of image-text posts. Experiments on 100 human users and 12 mainstream LVLMs reveal a clear perceptual gap: humans easily recognize such content (e.g., over 95.75\% accuracy), whereas current LVLMs often fail (e.g., ChatGPT-4o achieves only 2.10\% accuracy). Moreover, fine-tuning experiments demonstrate that \bench serves as an effective resource for improving model perception, increasing accuracy by 55.94\% for Qwen2.5VL-7B. Attention analysis and layer-wise probing further reveal that fine-tuning enhances sensitivity primarily in the early layers of the vision encoder, promoting a more integrated scene understanding. These findings highlight the inherent perceptual limitations in LVLMs and offer insight into more human-aligned visual reasoning systems.


OmniGuard: Unified Omni-Modal Guardrails with Deliberate Reasoning

Zhu, Boyu, Wen, Xiaofei, Mo, Wenjie Jacky, Zhu, Tinghui, Xie, Yanan, Qi, Peng, Chen, Muhao

arXiv.org Artificial Intelligence

Omni-modal Large Language Models (OLLMs) that process text, images, videos, and audio introduce new challenges for safety and value guardrails in human-AI interaction. Prior guardrail research largely targets unimodal settings and typically frames safeguarding as binary classification, which limits robustness across diverse modalities and tasks. To address this gap, we propose OmniGuard, the first family of omni-modal guardrails that performs safeguarding across all modalities with deliberate reasoning ability. To support the training of OMNIGUARD, we curate a large, comprehensive omni-modal safety dataset comprising over 210K diverse samples, with inputs that cover all modalities through both unimodal and cross-modal samples. Each sample is annotated with structured safety labels and carefully curated safety critiques from expert models through targeted distillation. Extensive experiments on 15 benchmarks show that OmniGuard achieves strong effectiveness and generalization across a wide range of multimodal safety scenarios. Importantly, OmniGuard provides a unified framework that enforces policies and mitigates risks in omni-modalities, paving the way toward building more robust and capable omnimodal safeguarding systems.


RAVEN++: Pinpointing Fine-Grained Violations in Advertisement Videos with Active Reinforcement Reasoning

Ji, Deyi, Yang, Yuekui, Liu, Liqun, Shu, Peng, Wu, Haiyang, Tang, Shaogang, Chen, Xudong, Ma, Shaoping, Chen, Tianrun, Zhu, Lanyun

arXiv.org Artificial Intelligence

Advertising (Ad) is a cornerstone of the digital economy, yet the moderation of video advertisements remains a significant challenge due to their complexity and the need for precise violation localization. While recent advancements, such as the RAVEN model, have improved coarse-grained violation detection, critical gaps persist in fine-grained understanding, explainability, and generalization. To address these limitations, we propose RAVEN++, a novel framework that introduces three key innovations: 1) Active Reinforcement Learning (RL), which dynamically adapts training to samples of varying difficulty; 2) Fine-Grained Violation Understanding, achieved through hierarchical reward functions and reasoning distillation; and 3) Progressive Multi-Stage Training, which systematically combines knowledge injection, curriculum-based passive RL, and active RL. Extensive experiments on both public and proprietary datasets, on both offline scenarios and online deployed A/B Testing, demonstrate that RAVEN++ outperforms general-purpose LLMs and specialized models like RAVEN in terms of fine-grained violation understanding, reasoning capabilities, and generalization ability.


MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations

Liu, Genglin, Le, Vivian, Rahman, Salman, Kreiss, Elisa, Ghassemi, Marzyeh, Gabriel, Saadia

arXiv.org Artificial Intelligence

We present a novel, open-source social network simulation framework, MOSAIC, where generative language agents predict user behaviors such as liking, sharing, and flagging content. This simulation combines LLM agents with a directed social graph to analyze emergent deception behaviors and gain a better understanding of how users determine the veracity of online social content. By constructing user representations from diverse fine-grained personas, our system enables multi-agent simulations that model content dissemination and engagement dynamics at scale. Within this framework, we evaluate three different content moderation strategies with simulated misinformation dissemination, and we find that they not only mitigate the spread of non-factual content but also increase user engagement. In addition, we analyze the trajectories of popular content in our simulations, and explore whether simulation agents' articulated reasoning for their social interactions truly aligns with their collective engagement patterns. We open-source our simulation software to encourage further research within AI and social sciences.


MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance

Goyal, Agam, Zhan, Xianyang, Chen, Yilun, Saha, Koustuv, Chandrasekharan, Eshwar

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to scalable content moderation. MoMoE orchestrates four operators -- Allocate, Predict, Aggregate, Explain -- and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak accuracy, norm-violation experts provide steadier performance across domains. These findings show that MoMoE yields scalable, transparent moderation without needing per-community fine-tuning. More broadly, they suggest that lightweight, explainable expert ensembles can guide future NLP and HCI research on trustworthy human-AI governance of online communities.


Quantifying Feature Importance for Online Content Moderation

Tessa, Benedetta, Moreo, Alejandro, Cresci, Stefano, Fagni, Tiziano, Sebastiani, Fabrizio

arXiv.org Artificial Intelligence

Accurately estimating how users respond to moderation interventions is paramount for developing effective and user-centred moderation strategies. However, this requires a clear understanding of which user characteristics are associated with different behavioural responses, which is the goal of this work. We investigate the informativeness of 753 socio-behavioural, linguistic, relational, and psychological features, in predicting the behavioural changes of 16.8K users affected by a major moderation intervention on Reddit. To reach this goal, we frame the problem in terms of "quantification", a task well-suited to estimating shifts in aggregate user behaviour. We then apply a greedy feature selection strategy with the double goal of (i) identifying the features that are most predictive of changes in user activity, toxicity, and participation diversity, and (ii) estimating their importance. Our results allow identifying a small set of features that are consistently informative across all tasks, and determining that many others are either task-specific or of limited utility altogether. We also find that predictive performance varies according to the task, with changes in activity and toxicity being easier to estimate than changes in diversity. Overall, our results pave the way for the development of accurate systems that predict user reactions to moderation interventions. Furthermore, our findings highlight the complexity of post-moderation user behaviour, and indicate that effective moderation should be tailored not only to user traits but also to the specific objective of the intervention.


Qwen3Guard Technical Report

Zhao, Haiquan, Yuan, Chenhan, Huang, Fei, Hu, Xiaomeng, Zhang, Yichang, Yang, An, Yu, Bowen, Liu, Dayiheng, Zhou, Jingren, Lin, Junyang, Yang, Baosong, Cheng, Chen, Tang, Jialong, Jiang, Jiandong, Zhang, Jianwei, Xu, Jijie, Yan, Ming, Sun, Minmin, Zhang, Pei, Xie, Pengjun, Tang, Qiaoyu, Zhu, Qin, Zhang, Rong, Wu, Shibin, Zhang, Shuo, He, Tao, Tang, Tianyi, Xia, Tingyu, Liao, Wei, Shen, Weizhou, Yin, Wenbiao, Zhou, Wenmeng, Yu, Wenyuan, Wang, Xiaobin, Deng, Xiaodong, Xu, Xiaodong, Zhang, Xinyu, Liu, Yang, Li, Yeqiu, Zhang, Yi, Jiang, Yong, Wan, Yu, Zhou, Yuxin

arXiv.org Artificial Intelligence

As large language models (LLMs) become more capable and widely used, ensuring the safety of their outputs is increasingly critical. Existing guardrail models, though useful in static evaluation settings, face two major limitations in real-world applications: (1) they typically output only binary "safe/unsafe" labels, which can be interpreted inconsistently across diverse safety policies, rendering them incapable of accommodating varying safety tolerances across domains; and (2) they require complete model outputs before performing safety checks, making them fundamentally incompatible with streaming LLM inference, thereby preventing timely intervention during generation and increasing exposure to harmful partial outputs. To address these challenges, we present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants: Generative Qwen3Guard, which casts safety classification as an instruction-following task to enable fine-grained tri-class judgments (safe, controversial, unsafe); and Stream Qwen3Guard, which introduces a token-level classification head for real-time safety monitoring during incremental text generation. Both variants are available in three sizes (0.6B, 4B, and 8B parameters) and support up to 119 languages and dialects, providing comprehensive, scalable, and low-latency safety moderation for global LLM deployments. Evaluated across English, Chinese, and multilingual benchmarks, Qwen3Guard achieves state-of-the-art performance in both prompt and response safety classification. All models are released under the Apache 2.0 license for public use.


ReasoningShield: Safety Detection over Reasoning Traces of Large Reasoning Models

Li, Changyi, Wang, Jiayi, Pan, Xudong, Hong, Geng, Yang, Min

arXiv.org Artificial Intelligence

Large Reasoning Models (LRMs) leverage transparent reasoning traces, known as Chain-of-Thoughts (CoTs), to break down complex problems into intermediate steps and derive final answers. However, these reasoning traces introduce unique safety challenges: harmful content can be embedded in intermediate steps even when final answers appear benign. Existing moderation tools, designed to handle generated answers, struggle to effectively detect hidden risks within CoTs. To address these challenges, we introduce ReasoningShield, a lightweight yet robust framework for moderating CoTs in LRMs. Our key contributions include: (1) formalizing the task of CoT moderation with a multi-level taxonomy of 10 risk categories across 3 safety levels, (2) creating the first CoT moderation benchmark which contains 9.2K pairs of queries and reasoning traces, including a 7K-sample training set annotated via a human-AI framework and a rigorously curated 2.2K human-annotated test set, and (3) developing a two-stage training strategy that combines stepwise risk analysis and contrastive learning to enhance robustness. Experiments show that ReasoningShield achieves state-of-the-art performance, outperforming task-specific tools like LlamaGuard-4 by 35.6% and general-purpose commercial models like GPT-4o by 15.8% on benchmarks, while also generalizing effectively across diverse reasoning paradigms, tasks, and unseen scenarios. All resources are released at https://github.com/CosmosYi/ReasoningShield.