Goto

Collaborating Authors

 content moderator


Is the AI boom finally starting to slow down?

The Guardian

Drive down the 280 freeway in San Francisco and you might believe AI is everywhere, and everything. Nearly every billboard advertises an AI related product: "We've Automated 2,412 BDRs." "All that AI and still no ROI?" "Cheap on-demand GPU clusters." It's hard to know if you're interpreting the industry jargon correctly while zooming past in your vehicle. The signs are just one example of the tech industry's en-masse pivot to AI, a technology that the executives who have the most to gain from it say will be universe-shifting, inevitable and unavoidable. In California's tech heartland, every company is now an AI company, just like every company became a tech company sometime in the 2010s.


When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines

Pendse, Sachin R., Gergle, Darren, Kornfield, Rachel, Meyerhoff, Jonah, Mohr, David, Suh, Jina, Wescott, Annie, Williams, Casey, Schleider, Jessica

arXiv.org Artificial Intelligence

Red-teaming is a core part of the infrastructure that ensures that AI models do not produce harmful content. Unlike past technologies, the black box nature of generative AI systems necessitates a uniquely interactional mode of testing, one in which individuals on red teams actively interact with the system, leveraging natural language to simulate malicious actors and solicit harmful outputs. This interactional labor done by red teams can result in mental health harms that are uniquely tied to the adversarial engagement strategies necessary to effectively red team. The importance of ensuring that generative AI models do not propagate societal or individual harm is widely recognized -- one less visible foundation of end-to-end AI safety is also the protection of the mental health and wellbeing of those who work to keep model outputs safe. In this paper, we argue that the unmet mental health needs of AI red-teamers is a critical workplace safety concern. Through analyzing the unique mental health impacts associated with the labor done by red teams, we propose potential individual and organizational strategies that could be used to meet these needs, and safeguard the mental health of red-teamers. We develop our proposed strategies through drawing parallels between common red-teaming practices and interactional labor common to other professions (including actors, mental health professionals, conflict photographers, and content moderators), describing how individuals and organizations within these professional spaces safeguard their mental health given similar psychological demands. Drawing on these protective practices, we describe how safeguards could be adapted for the distinct mental health challenges experienced by red teaming organizations as they mitigate emerging technological risks on the new digital frontlines.


Effective Black-Box Multi-Faceted Attacks Breach Vision Large Language Model Guardrails

Yang, Yijun, Wang, Lichao, Yang, Xiao, Hong, Lanqing, Zhu, Jun

arXiv.org Artificial Intelligence

Vision Large Language Models (VLLMs) integrate visual data processing, expanding their real-world applications, but also increasing the risk of generating unsafe responses. In response, leading companies have implemented Multi-Layered safety defenses, including alignment training, safety system prompts, and content moderation. However, their effectiveness against sophisticated adversarial attacks remains largely unexplored. In this paper, we propose MultiFaceted Attack, a novel attack framework designed to systematically bypass Multi-Layered Defenses in VLLMs. It comprises three complementary attack facets: Visual Attack that exploits the multimodal nature of VLLMs to inject toxic system prompts through images; Alignment Breaking Attack that manipulates the model's alignment mechanism to prioritize the generation of contrasting responses; and Adversarial Signature that deceives content moderators by strategically placing misleading information at the end of the response. Extensive evaluations on eight commercial VLLMs in a black-box setting demonstrate that MultiFaceted Attack achieves a 61.56% attack success rate, surpassing state-of-the-art methods by at least 42.18%.


Towards Conceptualization of "Fair Explanation": Disparate Impacts of anti-Asian Hate Speech Explanations on Content Moderators

Nguyen, Tin, Xu, Jiannan, Roy, Aayushi, Daumé, Hal III, Carpuat, Marine

arXiv.org Artificial Intelligence

Recent research at the intersection of AI explainability and fairness has focused on how explanations can improve human-plus-AI task performance as assessed by fairness measures. We propose to characterize what constitutes an explanation that is itself "fair" -- an explanation that does not adversely impact specific populations. We formulate a novel evaluation method of "fair explanations" using not just accuracy and label time, but also psychological impact of explanations on different user groups across many metrics (mental discomfort, stereotype activation, and perceived workload). We apply this method in the context of content moderation of potential hate speech, and its differential impact on Asian vs. non-Asian proxy moderators, across explanation approaches (saliency map and counterfactual explanation). We find that saliency maps generally perform better and show less evidence of disparate impact (group) and individual unfairness than counterfactual explanations. Content warning: This paper contains examples of hate speech and racially discriminatory language. The authors do not support such content. Please consider your risk of discomfort carefully before continuing reading!


'It's destroyed me completely': Kenyan moderators decry toll of training of AI models

The Guardian

The images pop up in Mophat Okinyi's mind when he's alone, or when he's about to sleep. Okinyi, a former content moderator for Open AI's ChatGPT in Nairobi, Kenya, is one of four people in that role who have filed a petition to the Kenyan government calling for an investigation into what they describe as exploitative conditions for contractors reviewing the content that powers artificial intelligence programs. "It has really damaged my mental health," said Okinyi. The 27-year-old said he would would view up to 700 text passages a day, many depicting graphic sexual violence. He recalls he started avoiding people after having read texts about rapists and found himself projecting paranoid narratives on to people around him.


Hybrid moderation in the newsroom: Recommending featured posts to content moderators

Waterschoot, Cedric, Bosch, Antal van den

arXiv.org Artificial Intelligence

Online news outlets are grappling with the moderation of user-generated content within their comment section. We present a recommender system based on ranking class probabilities to support and empower the moderator in choosing featured posts, a time-consuming task. By combining user and textual content features we obtain an optimal classification F1-score of 0.44 on the test set. Furthermore, we observe an optimum mean NDCG@5 of 0.87 on a large set of validation articles. As an expert evaluation, content moderators assessed the output of a random selection of articles by choosing comments to feature based on the recommendations, which resulted in a NDCG score of 0.83. We conclude that first, adding text features yields the best score and second, while choosing featured content remains somewhat subjective, content moderators found suitable comments in all but one evaluated recommendations. We end the paper by analyzing our best-performing model, a step towards transparency and explainability in hybrid content moderation.


BiasX: "Thinking Slow" in Toxic Content Moderation with Explanations of Implied Social Biases

Zhang, Yiming, Nanduri, Sravani, Jiang, Liwei, Wu, Tongshuang, Sap, Maarten

arXiv.org Artificial Intelligence

Toxicity annotators and content moderators often default to mental shortcuts when making decisions. This can lead to subtle toxicity being missed, and seemingly toxic but harmless content being over-detected. We introduce BiasX, a framework that enhances content moderation setups with free-text explanations of statements' implied social biases, and explore its effectiveness through a large-scale crowdsourced user study. We show that indeed, participants substantially benefit from explanations for correctly identifying subtly (non-)toxic content. The quality of explanations is critical: imperfect machine-generated explanations (+2.4% on hard toxic examples) help less compared to expert-written human explanations (+7.2%). Our results showcase the promise of using free-text explanations to encourage more thoughtful toxicity moderation.


150 African Workers for ChatGPT, TikTok and Facebook Vote to Unionize at Landmark Nairobi Meeting

TIME - Tech

More than 150 workers whose labor underpins the AI systems of Facebook, TikTok and ChatGPT gathered in Nairobi on Monday and pledged to establish the first African Content Moderators Union, in a move that could have significant consequences for the businesses of some of the world's biggest tech companies. The current and former workers, all employed by third party outsourcing companies, have provided content moderation services for AI tools used by Meta, Bytedance, and OpenAI--the respective owners of Facebook, TikTok and the breakout AI chatbot ChatGPT. Despite the mental toll of the work, which has left many content moderators suffering from PTSD, their jobs are some of the lowest-paid in the global tech industry, with some workers earning as little as $1.50 per hour. As news of the successful vote to register the union was read out, the packed room of workers at the Mövenpick Hotel in Nairobi burst into cheers and applause, a video from the event seen by TIME shows. Confetti fell onto the stage, and jubilant music began to play as the crowd continued to cheer.


ChatGPT and the sweatshops powering the digital age

Al Jazeera

On January 18, Time magazine published revelations that alarmed if not necessarily surprised many who work in Artificial Intelligence. The news concerned ChatGPT, an advanced AI chatbot that is both hailed as one of the most intelligent AI systems built to date and feared as a new frontier in potential plagiarism and the erosion of craft in writing. Many had wondered how ChatGPT, which stands for Chat Generative Pre-trained Transformer, had improved upon earlier versions of this technology that would quickly descend into hate speech. The answer came in the Time magazine piece: dozens of Kenyan workers were paid less than $2 per hour to process an endless amount of violent and hateful content in order to make a system primarily marketed to Western users safer. It should be clear to anyone paying attention that our current paradigm of digitalisation has a labour problem. We have and are pivoting away from the ideal of an open internet built around communities of shared interests to one that is dominated by the commercial prerogatives of a handful of companies located in specific geographies.


Users trust AI as much as humans for flagging problematic content

#artificialintelligence

Social media users may trust artificial intelligence (AI) as much as human editors to flag hate speech and harmful content, according to researchers at Penn State. The researchers said that when users think about positive attributes of machines, like their accuracy and objectivity, they show more faith in AI. However, if users are reminded about the inability of machines to make subjective decisions, their trust is lower. The findings may help developers design better AI-powered content curation systems that can handle the large amounts of information currently being generated while avoiding the perception that the material has been censored, or inaccurately classified, said S. Shyam Sundar, James P. Jimirro Professor of Media Effects in the Donald P. Bellisario College of Communications and co-director of the Media Effects Research Laboratory. "There's this dire need for content moderation on social media and more generally, online media," said Sundar, who is also an affiliate of Penn State's Institute for Computational and Data Sciences.