Goto

Collaborating Authors

 moderation decision


Decoding the Rule Book: Extracting Hidden Moderation Criteria from Reddit Communities

Kim, Youngwoo, Beniwal, Himanshu, Johnson, Steven L., Hartvigsen, Thomas

arXiv.org Artificial Intelligence

Effective content moderation systems require explicit classification criteria, yet online communities like subreddits often operate with diverse, implicit standards. This work introduces a novel approach to identify and extract these implicit criteria from historical moderation data using an interpretable architecture. We represent moderation criteria as score tables of lexical expressions associated with content removal, enabling systematic comparison across different communities. Our experiments demonstrate that these extracted lexical patterns effectively replicate the performance of neural moderation models while providing transparent insights into decision-making processes. The resulting criteria matrix reveals significant variations in how seemingly shared norms are actually enforced, uncovering previously undocumented moderation patterns including community-specific tolerances for language, features for topical restrictions, and underlying subcategories of the toxic speech classification.


Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing

Yadav, Neemesh, Liu, Jiarui, Ortu, Francesco, Ensafi, Roya, Jin, Zhijing, Mihalcea, Rada

arXiv.org Artificial Intelligence

The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work. Our code and data are available at https://github.com/causalNLP/censorship


Watch Your Language: Investigating Content Moderation with Large Language Models

Kumar, Deepak, AbuHashem, Yousef, Durumeric, Zakir

arXiv.org Artificial Intelligence

Large language models (LLMs) have exploded in popularity due to their ability to perform a wide array of natural language tasks. Text-based content moderation is one LLM use case that has received recent enthusiasm, however, there is little research investigating how LLMs perform in content moderation settings. In this work, we evaluate a suite of commodity LLMs on two common content moderation tasks: rule-based community moderation and toxic content detection. For rule-based community moderation, we instantiate 95 subcommunity specific LLMs by prompting GPT-3.5 with rules from 95 Reddit subcommunities. We find that GPT-3.5 is effective at rule-based moderation for many communities, achieving a median accuracy of 64% and a median precision of 83%. For toxicity detection, we evaluate a suite of commodity LLMs (GPT-3, GPT-3.5, GPT-4, Gemini Pro, LLAMA 2) and show that LLMs significantly outperform currently widespread toxicity classifiers. However, recent increases in model size add only marginal benefit to toxicity detection, suggesting a potential performance plateau for LLMs on toxicity detection tasks. We conclude by outlining avenues for future work in studying LLMs and content moderation.


Safety and Fairness for Content Moderation in Generative Models

Hao, Susan, Kumar, Piyush, Laszlo, Sarah, Poddar, Shivani, Radharapu, Bhaktipriya, Shelby, Renee

arXiv.org Artificial Intelligence

With significant advances in generative AI, new technologies are rapidly being deployed with generative components. Generative models are typically trained on large datasets, resulting in model behaviors that can mimic the worst of the content in the training data. Responsible deployment of generative technologies requires content moderation strategies, such as safety input and output filters. Here, we provide a theoretical framework for conceptualizing responsible content moderation of text-to-image generative technologies, including a demonstration of how to empirically measure the constructs we enumerate. We define and distinguish the concepts of safety, fairness, and metric equity, and enumerate example harms that can come in each domain. We then provide a demonstration of how the defined harms can be quantified. We conclude with a summary of how the style of harms quantification we demonstrate enables data-driven content moderation decisions.


Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild

Son, Donghyun, Lew, Byounggyu, Choi, Kwanghee, Baek, Yongsu, Choi, Seungwoo, Shin, Beomjun, Ha, Sungjoo, Chang, Buru

arXiv.org Artificial Intelligence

Social media platforms struggle to protect users from harmful content through content moderation. These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily. Since moderation policies vary depending on countries and types of products, it is common to train and deploy the models per policy. However, this approach is highly inefficient, especially when the policies change, requiring dataset re-labeling and model re-training on the shifted data distribution. To alleviate this cost inefficiency, social media platforms often employ third-party content moderation services that provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons, instead of directly providing final moderation decisions. However, making a reliable automated moderation decision from the prediction scores of the multiple subtasks for a specific target policy has not been widely explored yet. In this study, we formulate real-world scenarios of content moderation and introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way. Extensive experiments demonstrate that our approach shows better performance in content moderation compared to existing threshold optimization methods and heuristics.


Users question AI's ability to moderate online harassment

#artificialintelligence

New Cornell University research finds that both the type of moderator--human or AI--and the "temperature" of harassing content online influence people's perception of the moderation decision and the moderation system. Now published in Big Data & Society, the study used a custom social media site, on which people can post pictures of food and comment on other posts. The site contains a simulation engine, Truman, an open-source platform that mimics other users' behaviors (likes, comments, posts) through preprogrammed bots created and curated by researchers. The Truman platform--named after the 1998 film "The Truman Show"--was developed at the Cornell Social Media Lab led by Natalie Bazarova, professor of communication. "The Truman platform allows researchers to create a controlled yet realistic social media experience for participants, with social and design versatility to examine a variety of research questions about human behaviors in social media," Bazarova said.


Will AI be able to moderate online discussions like humans?

#artificialintelligence

Some artificial intelligence products have become so advanced in online discussion moderation that they will no longer be confused by colloquial language, neologisms or spelling mistakes. AI is able to take on routine human tasks, but cannot fully replace human intelligence. Online discussions are abound with hate speech and off-topic comments, causing massive headaches for media companies. Legislation requires that illegal messages are removed, and users are more content if they can avoid becoming the target of inappropriate insults. The volumes of comments posted on discussion forums and below news articles can be staggering, and their proper moderation may sometimes require infeasible amounts of manpower.