Safety Depth in Large Language Models: AMarkov Chain Perspective
–Neural Information Processing Systems
Large Language Models (LLMs) are increasingly adopted in high-stakes scenarios, yet their safety mechanisms often remain fragile. Simple jailbreak prompts or even benign fine-tuning can bypass internal safeguards, underscoring the need to understand the failure modes of current safety strategies. Recent findings suggest that vulnerabilities emerge when alignment is confined to only the initial output tokens. To address this, we introduce the notion of safety depth, a designated output position where the model refuses to generate harmful content. While deeper alignment appears promising, identifying the optimal safety depth remains an open and underexplored challenge.
Neural Information Processing Systems
Jun-21-2026, 13:17:24 GMT
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.67)
- Research Report
- Industry:
- Information Technology (0.46)
- Technology: