Goto

Collaborating Authors

 harassment


Echoes of Human Malice in Agents: Benchmarking LLMs for Multi-Turn Online Harassment Attacks

Padhi, Trilok, Lu, Pinxian, Erol, Abdulkadir, Sutar, Tanmay, Sharma, Gauri, Sonmez, Mina, De Choudhury, Munmun, Kursuncu, Ugur

arXiv.org Artificial Intelligence

Large Language Model (LLM) agents are powering a growing share of interactive web applications, yet remain vulnerable to misuse and harm. Prior jailbreak research has largely focused on single-turn prompts, whereas real harassment often unfolds over multi-turn interactions. In this work, we present the Online Harassment Agentic Benchmark consisting of: (i) a synthetic multi-turn harassment conversation dataset, (ii) a multi-agent (e.g., harasser, victim) simulation informed by repeated game theory, (iii) three jailbreak methods attacking agents across memory, planning, and fine-tuning, and (iv) a mixed-methods evaluation framework. We utilize two prominent LLMs, LLaMA-3.1-8B-Instruct (open-source) and Gemini-2.0-flash (closed-source). Our results show that jailbreak tuning makes harassment nearly guaranteed with an attack success rate of 95.78--96.89% vs. 57.25--64.19% without tuning in Llama, and 99.33% vs. 98.46% without tuning in Gemini, while sharply reducing refusal rate to 1-2% in both models. The most prevalent toxic behaviors are Insult with 84.9--87.8% vs. 44.2--50.8% without tuning, and Flaming with 81.2--85.1% vs. 31.5--38.8% without tuning, indicating weaker guardrails compared to sensitive categories such as sexual or racial harassment. Qualitative evaluation further reveals that attacked agents reproduce human-like aggression profiles, such as Machiavellian/psychopathic patterns under planning, and narcissistic tendencies with memory. Counterintuitively, closed-source and open-source models exhibit distinct escalation trajectories across turns, with closed-source models showing significant vulnerability. Overall, our findings show that multi-turn and theory-grounded attacks not only succeed at high rates but also mimic human-like harassment dynamics, motivating the development of robust safety guardrails to ultimately keep online platforms safe and responsible.


AutoRAN: Automated Hijacking of Safety Reasoning in Large Reasoning Models

Liang, Jiacheng, Jiang, Tanqiu, Wang, Yuhui, Zhu, Rongyi, Ma, Fenglong, Wang, Ting

arXiv.org Artificial Intelligence

This paper presents AutoRAN, the first framework to automate the hijacking of internal safety reasoning in large reasoning models (LRMs). At its core, AutoRAN pioneers an execution simulation paradigm that leverages a weaker but less-aligned model to simulate execution reasoning for initial hijacking attempts and iteratively refine attacks by exploiting reasoning patterns leaked through the target LRM's refusals. This approach steers the target model to bypass its own safety guardrails and elaborate on harmful instructions. We evaluate AutoRAN against state-of-the-art LRMs, including GPT-o3/o4-mini and Gemini-2.5-Flash, across multiple benchmarks (AdvBench, HarmBench, and StrongReject). Results show that AutoRAN achieves approaching 100% success rate within one or few turns, effectively neutralizing reasoning-based defenses even when evaluated by robustly aligned external models. This work reveals that the transparency of the reasoning process itself creates a critical and exploitable attack surface, highlighting the urgent need for new defenses that protect models' reasoning traces rather than merely their final outputs.


AI-induced sexual harassment: Investigating Contextual Characteristics and User Reactions of Sexual Harassment by a Companion Chatbot

Mohammad, null, Namvarpour, null, Pauwels, Harrison, Razi, Afsaneh

arXiv.org Artificial Intelligence

Advancements in artificial intelligence (AI) have led to the increase of conversational agents like Replika, designed to provide social interaction and emotional support. However, reports of these AI systems engaging in inappropriate sexual behaviors with users have raised significant concerns. In this study, we conducted a thematic analysis of user reviews from the Google Play Store to investigate instances of sexual harassment by the Replika chatbot. From a dataset of 35,105 negative reviews, we identified 800 relevant cases for analysis. Our findings revealed that users frequently experience unsolicited sexual advances, persistent inappropriate behavior, and failures of the chatbot to respect user boundaries. Users expressed feelings of discomfort, violation of privacy, and disappointment, particularly when seeking a platonic or therapeutic AI companion. This study highlights the potential harms associated with AI companions and underscores the need for developers to implement effective safeguards and ethical guidelines to prevent such incidents. By shedding light on user experiences of AI-induced harassment, we contribute to the understanding of AI-related risks and emphasize the importance of corporate responsibility in developing safer and more ethical AI systems.


China's cyber-abuse scandal: is the government unwilling to crack down on exploitation of women online?

The Guardian

When Ming* found a hidden camera in her bedroom, she prayed for a reasonable explanation, wondering whether her boyfriend had placed it there to record memories of their "happy life" together. But hope quickly turned to horror. Ming's boyfriend had been secretly taking sexually exploitative photos of not just Ming and her female friends, but also of other women in other locations, then using AI technology to generate pornographic images of them. After Ming confronted him, he "begged for mercy" but became angry when she refused to forgive him, Ming reportedly told Chinese news outlet Jimu News. Ming is just one of many women in China who have been covertly photographed or filmed – both in private and public spaces, including toilets – by voyeurs who have then circulated or sold the images online without consent.


Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities

Qu, Yiting, Backes, Michael, Zhang, Yang

arXiv.org Artificial Intelligence

Vision-language models (VLMs) are increasingly applied to identify unsafe or inappropriate images due to their internal ethical standards and powerful reasoning abilities. However, it is still unclear whether they can recognize various unsafe concepts when presented in different modalities, such as text and images. To address this, we first compile the UnsafeConcepts dataset, featuring 75 unsafe concepts, i.e., ``Swastika,'' ``Sexual Harassment,'' and ``Assaults,'' along with associated 1.5K images. We then conduct a systematic evaluation of VLMs' perception (concept recognition) and alignment (ethical reasoning) capabilities. We assess eight popular VLMs and find that, although most VLMs accurately perceive unsafe concepts, they sometimes mistakenly classify these concepts as safe. We also identify a consistent modality gap among open-source VLMs in distinguishing between visual and textual unsafe concepts. To bridge this gap, we introduce a simplified reinforcement learning (RL)-based approach using proximal policy optimization (PPO) to strengthen the ability to identify unsafe concepts from images. Our approach uses reward scores based directly on VLM responses, bypassing the need for collecting human-annotated preference data to train a new reward model. Experimental results show that our approach effectively enhances VLM alignment on images while preserving general capabilities. It outperforms baselines such as supervised fine-tuning (SFT) and direct preference optimization (DPO). We hope our dataset, evaluation findings, and proposed alignment solution contribute to the community's efforts in advancing safe VLMs.


Three Ubisoft chiefs found guilty of enabling culture of sexual harassment

The Guardian

Three former executives at the video game company Ubisoft have been given suspended prison sentences for enabling a culture of sexual and psychological harassment in the workplace at the end of the first big trial to stem from the #MeToo movement in the gaming industry. The court in Bobigny, north of Paris, had heard how the former executives used their position to bully or sexually harass staff, leaving women terrified and feeling like pieces of meat. Former staff had said that between 2012 and 2020, the company's offices in Montreuil, east of Paris, were run with a toxic culture of bullying and sexism that one worker likened to a "boys' club above the law". Ubisoft is a French family business that rose to become one of the biggest video game creators in the world. The company has been behind several blockbusters including Assassin's Creed, Far Cry and the children's favourite Just Dance.


PromptAug: Fine-grained Conflict Classification Using Data Augmentation

Warke, Oliver, Jose, Joemon M., Hasibi, Faegheh, Breitsohl, Jan

arXiv.org Artificial Intelligence

Given the rise of conflicts on social media, effective classification models to detect harmful behaviours are essential. Following the garbage-in-garbage-out maxim, machine learning performance depends heavily on training data quality. However, high-quality labelled data, especially for nuanced tasks like identifying conflict behaviours, is limited, expensive, and difficult to obtain. Additionally, as social media platforms increasingly restrict access to research data, text data augmentation is gaining attention as an alternative to generate training data. Augmenting conflict-related data poses unique challenges due to Large Language Model (LLM) guardrails that prevent generation of offensive content. This paper introduces PromptAug, an innovative LLM-based data augmentation method. PromptAug achieves statistically significant improvements of 2% in both accuracy and F1-score on conflict and emotion datasets. To thoroughly evaluate PromptAug against other data augmentation methods we conduct a robust evaluation using extreme data scarcity scenarios, quantitative diversity analysis and a qualitative thematic analysis. The thematic analysis identifies four problematic patterns in augmented text: Linguistic Fluidity, Humour Ambiguity, Augmented Content Ambiguity, and Augmented Content Misinterpretation. Overall, this work presents PromptAug as an effective method for augmenting data in sensitive tasks like conflict detection, offering a unique, interdisciplinary evaluation grounded in both natural language processing and social science methodology.


A man stalked a professor for six years. Then he used AI chatbots to lure strangers to her home

The Guardian

A man from Massachusetts has agreed to plead guilty to a seven-year cyberstalking campaign that included using artificial intelligence (AI) chatbots to impersonate a university professor and invite men online to her home address for sex. James Florence, 36, used platforms such as CrushOn.ai and JanitorAI, which allow users to design their own chatbots and direct them how to respond to other users during chats, including in sexually suggestive and explicit ways, according to court documents seen by the Guardian. The victim's identity has been kept confidential by law enforcement officials. Florence admitted to using the victim's personal and professional information – including her home address, date of birth and family information to instruct the chatbots to impersonate her and engage in sexual dialogue with users, per court filings. He told the chatbots to answer "yes" in the guise of his victim when a user asked whether she was sexually adventurous and fed the AI responses of what underwear she liked to wear.


Detecting harassment and defamation in cyberbullying with emotion-adaptive training

Yi, Peiling, Zubiaga, Arkaitz, Long, Yunfei

arXiv.org Artificial Intelligence

Existing research on detecting cyberbullying incidents on social media has primarily concentrated on harassment and is typically approached as a binary classification task. However, cyberbullying encompasses various forms, such as denigration and harassment, which celebrities frequently face. Furthermore, suitable training data for these diverse forms of cyberbullying remains scarce. In this study, we first develop a celebrity cyberbullying dataset that encompasses two distinct types of incidents: harassment and defamation. We investigate various types of transformer-based models, namely masked (RoBERTa, Bert and DistilBert), replacing(Electra), autoregressive (XLnet), masked&permuted (Mpnet), text-text (T5) and large language models (Llama2 and Llama3) under low source settings. We find that they perform competitively on explicit harassment binary detection. However, their performance is substantially lower on harassment and denigration multi-classification tasks. Therefore, we propose an emotion-adaptive training framework (EAT) that helps transfer knowledge from the domain of emotion detection to the domain of cyberbullying detection to help detect indirect cyberbullying events. EAT consistently improves the average macro F1, precision and recall by 20% in cyberbullying detection tasks across nine transformer-based models under low-resource settings. Our claims are supported by intuitive theoretical insights and extensive experiments.


The Video Game Industry Is Finally Getting Serious About Player Safety

WIRED

In 2025 we will enter a new era of safety by design for our digital playgrounds. Online games are spaces where billions of people worldwide come together to play, socialize, and unwind. However, they are also environments where harassment, hate speech, and grooming for violence and sexual exploration frequently occur. Today, most players of online games report being a direct target or witnessing one or more of these actions. A 2024 report found 82 percent of players report being a direct victim, and 88 percent report witnessing some form of so-called "toxic" behavior.