From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Jun-12-2026, 03:11:16 GMT–Neural Information Processing Systems

Though safety alignment has been applied to most large language models (LLMs), LLM service providers generally deploy a subsequent moderation as the external safety guardrail in real-world products. Existing moderators mainly practice a conventional full detection, which determines the harmfulness based on the complete LLM output, causing high service latency. Recent works pay more attention to partial detection where moderators oversee the generation midway and early stop the output if harmfulness is detected, but they directly apply moderators trained with the full detection paradigm to incomplete outputs, introducing a training-inference gap that lowers the performance. In this paper, we explore how to form a data-and-model solution that natively supports partial detection.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Jun-12-2026, 03:11:16 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)