Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection
Li, Xiaodan, Wu, Mengjie, Zhu, Yao, Lv, Yunna, Chen, YueFeng, Chen, Cen, Guo, Jianmei, Xue, Hui
–arXiv.org Artificial Intelligence
Large models (LMs) are powerful content generators, yet their open-ended nature can also introduce potential risks, such as generating harmful or biased content. Existing guardrails mostly perform post-hoc detection that may expose unsafe content before it is caught, and the latency constraints further push them toward lightweight models, limiting detection accuracy. In this work, we propose Kelp, a novel plug-in framework that enables streaming risk detection within the LM generation pipeline. Kelp leverages intermediate LM hidden states through a Streaming Latent Dynamics Head (SLD), which models the temporal evolution of risk across the generated sequence for more accurate real-time risk detection. To ensure reliable streaming moderation in real applications, we introduce an Anchored Temporal Consistency (ATC) loss to enforce monotonic harm predictions by embedding a benign-then-harmful temporal prior. Besides, for a rigorous evaluation of streaming guardrails, we also present StreamGuardBench-a model-grounded benchmark featuring on-the-fly responses from each protected model, reflecting real-world streaming scenarios in both text and vision-language tasks. Across diverse models and datasets, Kelp consistently outperforms state-of-the-art post-hoc guardrails and prior plug-in probes (15.61% higher average F1), while using only 20M parameters and adding less than 0.5 ms of per-token latency.
arXiv.org Artificial Intelligence
Oct-14-2025
- Genre:
- Research Report (0.64)
- Industry:
- Government (0.68)
- Information Technology > Security & Privacy (0.46)
- Law (1.00)
- Technology: