Understanding and Rectifying Safety Perception Distortion in VLMs
–Neural Information Processing Systems
Recent studies reveal that vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality, exhibiting greater vulnerability than their text-only LLM backbones. To uncover the root cause of this phenomenon, we conduct an in-depth analysis and identify a key issue: multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts, leading VLMs to systematically overestimate the safety of harmful inputs. We refer to this issue as safety perception distortion. To mitigate such distortion, we propose Activation Shift Disentanglement and Calibration (ShiftDC), a training-free method that decomposes and calibrates the modality-induced activation shift to reduce its impact on safety.
Neural Information Processing Systems
Jun-21-2026, 07:22:47 GMT
- Country:
- North America > United States (0.46)
- Europe > Austria (0.28)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Information Technology > Security & Privacy (1.00)
- Law (0.68)
- Government (0.67)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Natural Language > Large Language Model (0.90)
- Machine Learning
- Performance Analysis > Accuracy (0.68)
- Neural Networks > Deep Learning (0.46)
- Information Technology > Artificial Intelligence