Goto

Collaborating Authors

 human moderator


Question the Questions: Auditing Representation in Online Deliberative Processes

arXiv.org Artificial Intelligence

A central feature of many deliberative processes, such as citizens' assemblies and deliberative polls, is the opportunity for participants to engage directly with experts. While participants are typically invited to propose questions for expert panels, only a limited number can be selected due to time constraints. This raises the challenge of how to choose a small set of questions that best represent the interests of all participants. We introduce an auditing framework for measuring the level of representation provided by a slate of questions, based on the social choice concept known as justified representation (JR). We present the first algorithms for auditing JR in the general utility setting, with our most efficient algorithm achieving a runtime of $O(mn\log n)$, where $n$ is the number of participants and $m$ is the number of proposed questions. We apply our auditing methods to historical deliberations, comparing the representativeness of (a) the actual questions posed to the expert panel (chosen by a moderator), (b) participants' questions chosen via integer linear programming, (c) summary questions generated by large language models (LLMs). Our results highlight both the promise and current limitations of LLMs in supporting deliberative processes. By integrating our methods into an online deliberation platform that has been used for over hundreds of deliberations across more than 50 countries, we make it easy for practitioners to audit and improve representation in future deliberations.


RedHerring Attack: Testing the Reliability of Attack Detection

arXiv.org Artificial Intelligence

In response to adversarial text attacks, attack detection models have been proposed and shown to successfully identify text modified by adversaries. Attack detection models can be leveraged to provide an additional check for NLP models and give signals for human input. However, the reliability of these models has not yet been thoroughly explored. Thus, we propose and test a novel attack setting and attack, RedHerring. RedHerring aims to make attack detection models unreliable by modifying a text to cause the detection model to predict an attack, while keeping the classifier correct. This creates a tension between the classifier and detector. If a human sees that the detector is giving an ``incorrect'' prediction, but the classifier a correct one, then the human will see the detector as unreliable. We test this novel threat model on 4 datasets against 3 detectors defending 4 classifiers. We find that RedHerring is able to drop detection accuracy between 20 - 71 points, while maintaining (or improving) classifier accuracy. As an initial defense, we propose a simple confidence check which requires no retraining of the classifier or detector and increases detection accuracy greatly. This novel threat model offers new insights into how adversaries may target detection models.


Scalable Evaluation of Online Moderation Strategies via Synthetic Simulations

arXiv.org Artificial Intelligence

Despite the ever-growing importance of online moderation, there has been no large-scale study evaluating the effectiveness of alternative moderation strategies. This is largely due to the lack of appropriate datasets, and the difficulty of getting human discussants, moderators, and evaluators involved in multiple experiments. In this paper, we propose a methodology for leveraging synthetic experiments performed exclusively by Large Language Models (LLMs) to initially bypass the need for human participation in experiments involving online moderation. We evaluate six LLM moderation configurations; two currently used real-life moderation strategies (guidelines issued for human moderators for online moderation and real-life facilitation), two baseline strategies (guidelines elicited for LLM alignment work, and LLM moderation with minimal prompting) a baseline with no moderator at all, as well as our own proposed strategy inspired by a Reinforcement Learning (RL) formulation of the problem. We find that our own moderation strategy significantly outperforms established moderation guidelines, as well as out-of-the-box LLM moderation. We also find that smaller LLMs, with less intensive instruction-tuning, can create more varied discussions than larger models. In order to run these experiments, we create and release an efficient, purpose-built, open-source Python framework, dubbed "SynDisco" to easily simulate hundreds of discussions using LLM user-agents and moderators. Additionally, we release the Virtual Moderation Dataset (VMD), a large dataset of LLM-generated and LLM-annotated discussions, generated by three families of open-source LLMs accompanied by an exploratory analysis of the dataset.


SLM-Mod: Small Language Models Surpass LLMs at Content Moderation

arXiv.org Artificial Intelligence

Large language models (LLMs) have shown promise in many natural language understanding tasks, including content moderation. However, these models can be expensive to query in real-time and do not allow for a community-specific approach to content moderation. To address these challenges, we explore the use of open-source small language models (SLMs) for community-specific content moderation tasks. We fine-tune and evaluate SLMs (less than 15B parameters) by comparing their performance against much larger open- and closed-sourced models. Using 150K comments from 15 popular Reddit communities, we find that SLMs outperform LLMs at content moderation -- 11.5% higher accuracy and 25.7% higher recall on average across all communities. We further show the promise of cross-community content moderation, which has implications for new communities and the development of cross-platform moderation techniques. Finally, we outline directions for future work on language model based content moderation. Code and links to HuggingFace models can be found at https://github.com/AGoyal0512/SLM-Mod.


Large Language Models for Automatic Detection of Sensitive Topics

arXiv.org Artificial Intelligence

Sensitive information detection is crucial in content moderation to maintain safe online communities. Assisting in this traditionally manual process could relieve human moderators from overwhelming and tedious tasks, allowing them to focus solely on flagged content that may pose potential risks. Rapidly advancing large language models (LLMs) are known for their capability to understand and process natural language and so present a potential solution to support this process. This study explores the capabilities of five LLMs for detecting sensitive messages in the mental well-being domain within two online datasets and assesses their performance in terms of accuracy, precision, recall, F1 scores, and consistency. Our findings indicate that LLMs have the potential to be integrated into the moderation workflow as a convenient and precise detection tool. The best-performing model, GPT-4o, achieved an average accuracy of 99.5\% and an F1-score of 0.99. We discuss the advantages and potential challenges of using LLMs in the moderation workflow and suggest that future research should address the ethical considerations of utilising this technology.


Vicarious Offense and Noise Audit of Offensive Speech Classifiers: Unifying Human and Machine Disagreement on What is Offensive

arXiv.org Artificial Intelligence

Offensive speech detection is a key component of content moderation. However, what is offensive can be highly subjective. This paper investigates how machine and human moderators disagree on what is offensive when it comes to real-world social web political discourse. We show that (1) there is extensive disagreement among the moderators (humans and machines); and (2) human and large-language-model classifiers are unable to predict how other human raters will respond, based on their political leanings. For (1), we conduct a noise audit at an unprecedented scale that combines both machine and human responses. For (2), we introduce a first-of-its-kind dataset of vicarious offense. Our noise audit reveals that moderation outcomes vary wildly across different machine moderators. Our experiments with human moderators suggest that political leanings combined with sensitive issues affect both first-person and vicarious offense. The dataset is available through https://github.com/Homan-Lab/voiced.


OpenAI is using GPT-4 to build an AI-powered content moderation system

Engadget

Content moderation has been one of the thorniest issues on the internet for decades. It's a difficult subject matter for anyone to tackle, considering the subjectivity that goes hand-in-hand with figuring out what content should be permissible on a given platform. ChatGPT maker OpenAI thinks it can help and it has been putting GPT-4's content moderation skills to the test. It's using the large multimodal model "to build a content moderation system that is scalable, consistent and customizable." The company wrote in a blog post that GPT-4 can not only help make content moderation decisions, but aid in developing policies and swiftly iterating on policy changes, "reducing the cycle from months to hours."


AI moderation will cause more harm than good

#artificialintelligence

Creating a game with a large, highly engaged online player base and an active community is, for many companies, right at the top of their wishlist. When they're really well managed, these games are a license to print money, to the extent that a single game can become a primary commercial driver of a pretty large company. Games like Fortnite, World of Warcraft, Call of Duty, Grand Theft Auto V, and Final Fantasy XIV, to name but a few, have become central to the ongoing success of the publishers who created and operate them. Their importance rests on the fact that while many popular franchises can rely on a huge launch for each new instalment, these games never actually stop being played and making money. It's no wonder that executives around the industry get dollar signs in their eyes when anyone starts talking about service-based games with high engagement. There are, of course, downsides.


How AI Is Moderating Online Content

#artificialintelligence

AI can help flag harmful or offensive content faster and more effectively! Whether it be by posting a photo on Instagram or writing a blog post, we're all adding more information to the internet. With over 4.62 billion people using social media, there are bound to be some bad eggs creating harmful or deceitful content. To make sure that users are exposed to as little bad content as possible, websites practice content moderation. Content moderation is the process of regulating and monitoring user-generated content based on a set of pre-existing rules and guidelines.


AI is not smart enough to solve Meta's content-policing problems, whistleblowers say

#artificialintelligence

Artificial intelligence is nowhere near good enough to address problems facing content moderation on Facebook, according to whistleblower Frances Haugen. Haugen appeared at an event in London Tuesday evening with Daniel Motaung, a former Facebook moderator who is suing the company in Kenya accusing it of human trafficking. Meta has praised the efficacy of its AI systems in the past. CEO Mark Zuckerberg told a Congressional hearing in March 2021 the company relies on AI to weed out over 95% of "hate speech content." In February this year Zuckerberg said the company wants to get its AI to a "human level" of intelligence.