Online Safety Monitoring for LLMs

Schirmer, Mona, Jazbec, Metod, Timans, Alexander, Naesseth, Christian, Waldron, Maja, Nalisnick, Eric

arXiv.org Machine Learning 

We deploy a simple into our everyday lives as search engines (Jin et al., 2025; statistical framework based on risk control (Angelopoulos Xiong et al., 2024), coding assistants (Zhao et al., 2023), et al., 2022) that converts any safety signal into a binary and companions (Zhang et al., 2025a). As their applicability grows, so does the potential harm caused by malicious decision rule, and offers statistical guarantees on the false LLM outputs. Despite remarkable performance across a alarm or missed detection rate. The framework is universally applicable to different monitoring purposes and can leverage wide range of tasks, LLMs remain prone to generating halarbitrary proxy signals. Through experiments on mathematlucinated, factually incorrect (Ravichander et al., 2025), or ical problem solving and red teaming conversations, we harmful output (Yu et al., 2025) when deployed.