C-SafeGen: Certified Safe LLMGeneration with Claim-Based Streaming Guardrails
–Neural Information Processing Systems
Despite the remarkable capabilities of large language models (LLMs) across diverse applications, they remain vulnerable to generating content that violates safety regulations and policies. To mitigate these risks, LLMs undergo safety alignment; however, they can still be effectively jailbroken. Off-the-shelf guardrail models are commonly deployed to monitor generations, but these models primarily focus on detection rather than ensuring safe decoding of LLM outputs. Moreover, existing efforts lack rigorous safety guarantees, which are crucial for the universal deployment of LLMs and certifiable compliance with regulatory standards. In this paper, we propose a Claim-based Stream Decoding (CSD) algorithm coupled with a statistical risk guarantee framework using conformal analysis.
Neural Information Processing Systems
Jun-17-2026, 17:45:53 GMT
- Country:
- North America > United States (0.46)
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.93)
- Research Report
- Industry:
- Government (0.68)
- Information Technology > Security & Privacy (0.46)
- Technology: