Online Safety Monitoring for LLMs

Schirmer, Mona, Jazbec, Metod, Timans, Alexander, Naesseth, Christian, Waldron, Maja, Nalisnick, Eric

Jul-3-2026–arXiv.org Machine Learning

We deploy a simple into our everyday lives as search engines (Jin et al., 2025; statistical framework based on risk control (Angelopoulos Xiong et al., 2024), coding assistants (Zhao et al., 2023), et al., 2022) that converts any safety signal into a binary and companions (Zhang et al., 2025a). As their applicability grows, so does the potential harm caused by malicious decision rule, and offers statistical guarantees on the false LLM outputs. Despite remarkable performance across a alarm or missed detection rate. The framework is universally applicable to different monitoring purposes and can leverage wide range of tasks, LLMs remain prone to generating halarbitrary proxy signals. Through experiments on mathematlucinated, factually incorrect (Ravichander et al., 2025), or ical problem solving and red teaming conversations, we harmful output (Yu et al., 2025) when deployed.

arxiv preprint arxiv, large language model, natural language, (13 more...)

arXiv.org Machine Learning

Jul-3-2026

arXiv.org PDF

Add feedback

Country:
- Asia (0.28)

Genre:
- Research Report (0.50)

Technology:
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found