Understanding and Rectifying Safety Perception Distortion in VLMs

Jun-21-2026, 07:22:47 GMT–Neural Information Processing Systems

Recent studies reveal that vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality, exhibiting greater vulnerability than their text-only LLM backbones. To uncover the root cause of this phenomenon, we conduct an in-depth analysis and identify a key issue: multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts, leading VLMs to systematically overestimate the safety of harmful inputs. We refer to this issue as safety perception distortion. To mitigate such distortion, we propose Activation Shift Disentanglement and Calibration (ShiftDC), a training-free method that decomposes and calibrates the modality-induced activation shift to reduce its impact on safety.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Jun-21-2026, 07:22:47 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.46)
- Europe > Austria (0.28)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Information Technology > Security & Privacy (1.00)
- Law (0.68)
- Government (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (0.90)
  - Machine Learning
    - Performance Analysis > Accuracy (0.68)
    - Neural Networks > Deep Learning (0.46)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found