One Head to Rule Them All: Amplifying LVLMSafety through a Single Critical Attention Head
–Neural Information Processing Systems
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in tasks requiring multimodal understanding. However, recent studies indicate that LVLMs are more vulnerable than LLMs to unsafe inputs and prone to generating harmful content. Existing defense strategies primarily include fine-tuning, input sanitization, and output intervention. Although these approaches provide a certain level of protection, they tend to be resource-intensive and struggle to effectively counter sophisticated attack techniques. To tackle such issues, we propose One-head Defense (Oh Defense), a novel yet simple approach utilizing LVLMs' internal safety capabilities. Through systematic analysis of the attention mechanisms, we discover that LVLMs' safety capabilities are concentrated within specific attention heads that respond differently to safe or unsafe inputs. Further exploration reveals that a single critical attention head can effectively serve as a safety guard, providing a strong discriminative signal that amplifies the model's inherent safety capabilities. Hence, the Oh Defense requires no additional training or external modules, making it computationally efficient while effectively reactivating suppressed safety mechanisms. Extensive experiments across diverse LVLM architectures and unsafe datasets validate our approach, i.e., the Oh Defense achieves near-perfect defense success rates (> 98%) for unsafe inputs while maintaining low false positive rates (< 5%) for safe content.
Neural Information Processing Systems
Jun-16-2026, 18:35:39 GMT
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.93)
- Research Report
- Industry:
- Information Technology > Security & Privacy (0.46)
- Government (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Natural Language > Large Language Model (1.00)
- Machine Learning
- Performance Analysis > Accuracy (1.00)
- Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence