Should LLM Safety Be More Than Refusing Harmful Instructions?
Maskey, Utsav, Dras, Mark, Naseem, Usman
–arXiv.org Artificial Intelligence
This paper presents a systematic evaluation of Large Language Models' (LLMs) behavior on long-tail distributed (encrypted) texts and their safety implications. We introduce a two-dimensional framework for assessing LLM safety: (1) instruction refusal-the ability to reject harmful obfuscated instructions, and (2) generation safety-the suppression of generating harmful responses. Through comprehensive experiments, we demonstrate that models that possess capabilities to decrypt ciphers may be susceptible to mismatched-generalization attacks: their safety mechanisms fail on at least one safety dimension, leading to unsafe responses or over-refusal. Based on these findings, we evaluate a number of pre-LLM and post-LLM safeguards and discuss their strengths and limitations. This work contributes to understanding the safety of LLM in long-tail text scenarios and provides directions for developing robust safety mechanisms.
arXiv.org Artificial Intelligence
Jun-5-2025
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Health & Medicine (1.00)
- Information Technology > Security & Privacy (1.00)
- Technology: