LLM Content Moderation and User Satisfaction: Evidence from Response Refusals in Chatbot Arena

Jan-4-2025–arXiv.org Artificial Intelligence

LLM safety and ethical alignment are widely discussed, but the impact of content moderation on user satisfaction remains underexplored. To address this, we analyze nearly 50,000 Chatbot Arena response - pairs using a novel fine - tuned RoBER T a model, that we trained on hand - labeled data to disentangle refusals due to ethical concerns from other refusals due to technical disabilities or lack of information. Our findings reveal a significant refusal penalty on content moderation, with users choosing ethical - based refusals roughly one - fourth as often as their preferred LLM response compared to standard responses . However, the context and phrasing play critical roles: refusals on highly sensitive prompts, such as illegal content, achieve higher win rates than less sensitive ethical concerns, and longer responses closely aligned with the prompt perform better. These results emphasize the need for nuanced moderation strategies that balance ethical safeguards with user satisfaction. Moreover, we find that the refusal penalty is notably lower in evaluations using the LLM - as - a - Judge method, highlighting discrepancies be tween user and automated assessments. Trigger Warning and Disclaimer: This paper discusses content moderation in LLMs, including sensitive topics such as hate speech, harassment, and illegal activities, as part of an analysis of LLM performance and user satisfaction. The study does not endorse or promote any harmful, illegal, or unethical content, nor does it make any normative judgments about the " right amount " of content moderation.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jan-4-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China
  - Yunnan Province > Kunming (0.04)
- Europe
  - France (0.04)
  - Spain > Galicia
    - Madrid (0.04)
  - Switzerland (0.04)
- North America > Canada (0.04)

Genre:
- Research Report > New Finding (1.00)

Industry:
- Law (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)
  - Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found