Toxicity Detection for Free
–Neural Information Processing Systems
In this paper, we introduce Moderation Using LLM Introspection (MULI), which detects toxic prompts using the information extracted directly from LLMs themselves. We found we can distinguish between benign and toxic prompts from the distribution of the first response token's logits.
Neural Information Processing Systems
Nov-14-2025, 11:25:47 GMT
- Country:
- Asia
- Europe > United Kingdom
- England (0.04)
- North America > United States
- California > Alameda County > Berkeley (0.04)
- Genre:
- Research Report > Experimental Study (1.00)
- Industry:
- Education (0.93)
- Information Technology > Security & Privacy (0.68)
- Technology: