moderator
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Warning: this paper may contain potentially generated harmful content. Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-andmodel solution that natively supports partial detection. For the data, we construct FineHarm, a dataset consisting of 29K prompt-response pairs with fine-grained token-level annotations to provide reasonable supervision for token-level training. Then, we propose the Streaming Content Monitor (SCM), which is trained with dual supervision of response-and token-level labels and can follow the output stream of LLM to make a timely judgment of harmfulness. Experiments show that SCM gains 0.95+ in macro F1 score that is comparable to full detection, by only seeing the first 18% of tokens in responses on average. Moreover, the SCM can serve as a pseudo-harmfulness annotator for improving safety alignment and lead to a higher harmlessness score than DPO.
Reddit's human content wins amid the AI flood
Reddit's human content wins amid the AI flood For Ines Tan there's one particular site she turns to again and again for advice - and that's Reddit. Tan, who works in communications, regularly jumps on the site for skincare advice, to view reactions to shows she watches, such as The Traitors, and for help planning her upcoming wedding in May. It's a very empathetic place, she says of Reddit. For my wedding, I've found help emotionally, logistically and inspiration-wise. Tan believes people are consulting the online discussion platform more as they're craving human interaction in the world of increasing AI slop.
'In the end, you feel blank': India's female workers watching hours of abusive content to train AI
A still from Humans in the Loop, a 2024 documentary that follows female data workers in Jharkhand state, India, whose labour underpins global AI systems. A still from Humans in the Loop, a 2024 documentary that follows female data workers in Jharkhand state, India, whose labour underpins global AI systems. 'In the end, you feel blank': India's female workers watching hours of abusive content to train AI Thu 5 Feb 2026 03.00 ESTLast modified on Thu 5 Feb 2026 03.03 EST On the veranda of her family's home, with her laptop balanced on a mud slab built into the wall, Monsumi Murmu works from one of the few places where the mobile signal holds. The familiar sounds of domestic life come from inside the house: clinking utensils, footsteps, voices. On her screen a very different scene plays: a woman is pinned down by a group of men, the camera shakes, there is shouting and the sound of breathing.
AI Slop Is Ruining Reddit for Everyone
Reddit is considered one of the most human spaces left on the internet, but mods and users are overwhelmed with slop posts in the most popular subreddits. A Reddit post about a bride who demands a wedding guest wear a specific, unflattering shade is sure to provoke rage, let alone one about a bridesmaid or mother of the groom who wants to wear white. A scenario where a parent asks someone on an airplane to switch seats so they can sit next to their young child is likely to invoke the same rush of anger. But those posts may trigger a Reddit moderator's annoyance for a different reason--they are common themes within a growing genre of AI -generated, fake posts. These are examples that spring to mind for Cassie, one of dozens of moderators for r/AmItheAsshole .
Ask WhAI:Probing Belief Formation in Role-Primed LLM Agents
Moore, Keith, Kim, Jun W., Lyu, David, Heo, Jeffrey, Adeli, Ehsan
We present Ask WhAI, a systems-level framework for inspecting and perturbing belief states in multi-agent interactions. The framework records and replays agent interactions, supports out-of-band queries into each agent's beliefs and rationale, and enables counterfactual evidence injection to test how belief structures respond to new information. We apply the framework to a medical case simulator notable for its multi-agent shared memory (a time-stamped electronic medical record, or EMR) and an oracle agent (the LabAgent) that holds ground truth lab results revealed only when explicitly queried. We stress-test the system on a multi-specialty diagnostic journey for a child with an abrupt-onset neuropsychiatric presentation. Large language model agents, each primed with strong role-specific priors ("act like a neurologist", "act like an infectious disease specialist"), write to a shared medical record and interact with a moderator across sequential or parallel encounters. Breakpoints at key diagnostic moments enable pre- and post-event belief queries, allowing us to distinguish entrenched priors from reasoning or evidence-integration effects. The simulation reveals that agent beliefs often mirror real-world disciplinary stances, including overreliance on canonical studies and resistance to counterevidence, and that these beliefs can be traced and interrogated in ways not possible with human experts. By making such dynamics visible and testable, Ask WhAI offers a reproducible way to study belief formation and epistemic silos in multi-agent scientific reasoning.
SlideBot: A Multi-Agent Framework for Generating Informative, Reliable, Multi-Modal Presentations
Xie, Eric, Waterfield, Danielle, Kennedy, Michael, Zhang, Aidong
Large Language Models (LLMs) have shown immense potential in education, automating tasks like quiz generation and content summarization. However, generating effective presentation slides introduces unique challenges due to the complexity of multimodal content creation and the need for precise, domain-specific information. Existing LLM-based solutions often fail to produce reliable and informative outputs, limiting their educational value. To address these limitations, we introduce SlideBot - a modular, multi-agent slide generation framework that integrates LLMs with retrieval, structured planning, and code generation. SlideBot is organized around three pillars: informativeness, ensuring deep and contextually grounded content; reliability, achieved by incorporating external sources through retrieval; and practicality, which enables customization and iterative feedback through instructor collaboration. It incorporates evidence-based instructional design principles from Cognitive Load Theory (CLT) and the Cognitive Theory of Multimedia Learning (CTML), using structured planning to manage intrinsic load and consistent visual macros to reduce extraneous load and enhance dual-channel learning. Within the system, specialized agents collaboratively retrieve information, summarize content, generate figures, and format slides using LaTeX, aligning outputs with instructor preferences through interactive refinement. Evaluations from domain experts and students in AI and biomedical education show that SlideBot consistently enhances conceptual accuracy, clarity, and instructional value. These findings demonstrate SlideBot's potential to streamline slide preparation while ensuring accuracy, relevance, and adaptability in higher education.
Question the Questions: Auditing Representation in Online Deliberative Processes
De, Soham, Gelauff, Lodewijk, Goel, Ashish, Milli, Smitha, Procaccia, Ariel, Siu, Alice
A central feature of many deliberative processes, such as citizens' assemblies and deliberative polls, is the opportunity for participants to engage directly with experts. While participants are typically invited to propose questions for expert panels, only a limited number can be selected due to time constraints. This raises the challenge of how to choose a small set of questions that best represent the interests of all participants. We introduce an auditing framework for measuring the level of representation provided by a slate of questions, based on the social choice concept known as justified representation (JR). We present the first algorithms for auditing JR in the general utility setting, with our most efficient algorithm achieving a runtime of $O(mn\log n)$, where $n$ is the number of participants and $m$ is the number of proposed questions. We apply our auditing methods to historical deliberations, comparing the representativeness of (a) the actual questions posed to the expert panel (chosen by a moderator), (b) participants' questions chosen via integer linear programming, (c) summary questions generated by large language models (LLMs). Our results highlight both the promise and current limitations of LLMs in supporting deliberative processes. By integrating our methods into an online deliberation platform that has been used for over hundreds of deliberations across more than 50 countries, we make it easy for practitioners to audit and improve representation in future deliberations.
Collab-REC: An LLM-based Agentic Framework for Balancing Recommendations in Tourism
Banerjee, Ashmi, Satish, Adithi, Aisyah, Fitri Nur, Wรถrndl, Wolfgang, Deldjoo, Yashar
We propose Collab-REC, a multi-agent framework designed to counteract popularity bias and enhance diversity in tourism recommendations. In our setting, three LLM-based agents -- Personalization, Popularity, and Sustainability generate city suggestions from complementary perspectives. A non-LLM moderator then merges and refines these proposals via multi-round negotiation, ensuring each agent's viewpoint is incorporated while penalizing spurious or repeated responses. Experiments on European city queries show that Collab-REC improves diversity and overall relevance compared to a single-agent baseline, surfacing lesser-visited locales that often remain overlooked. This balanced, context-aware approach addresses over-tourism and better aligns with constraints provided by the user, highlighting the promise of multi-stakeholder collaboration in LLM-driven recommender systems.
MoMoE: Mixture of Moderation Experts Framework for AI-Assisted Online Governance
Goyal, Agam, Zhan, Xianyang, Chen, Yilun, Saha, Koustuv, Chandrasekharan, Eshwar
Large language models (LLMs) have shown great potential in flagging harmful content in online communities. Yet, existing approaches for moderation require a separate model for every community and are opaque in their decision-making, limiting real-world adoption. We introduce Mixture of Moderation Experts (MoMoE), a modular, cross-community framework that adds post-hoc explanations to scalable content moderation. MoMoE orchestrates four operators -- Allocate, Predict, Aggregate, Explain -- and is instantiated as seven community-specialized experts (MoMoE-Community) and five norm-violation experts (MoMoE-NormVio). On 30 unseen subreddits, the best variants obtain Micro-F1 scores of 0.72 and 0.67, respectively, matching or surpassing strong fine-tuned baselines while consistently producing concise and reliable explanations. Although community-specialized experts deliver the highest peak accuracy, norm-violation experts provide steadier performance across domains. These findings show that MoMoE yields scalable, transparent moderation without needing per-community fine-tuning. More broadly, they suggest that lightweight, explainable expert ensembles can guide future NLP and HCI research on trustworthy human-AI governance of online communities.