SafeR-CLIP: Mitigating NSFW Content in Vision-Language Models While Preserving Pre-Trained Knowledge
Yousaf, Adeel, Fioresi, Joseph, Beetham, James, Bedi, Amrit Singh, Shah, Mubarak
–arXiv.org Artificial Intelligence
Improving the safety of vision-language models like CLIP via fine-tuning often comes at a steep price, causing significant drops in their generalization performance. We find this trade-off stems from rigid alignment strategies that force unsafe concepts toward single, predefined safe targets, disrupting the model's learned semantic structure. To address this, we propose a proximity-aware approach: redirecting unsafe concepts to their semantically closest safe alternatives to minimize representational change. We introduce SafeR-CLIP, a fine-tuning framework that applies this principle of minimal intervention. SafeR-CLIP successfully reconciles safety and performance, recovering up to 8.0% in zero-shot accuracy over prior methods while maintaining robust safety. To support more rigorous evaluation, we also contribute NSFW-Caps, a new benchmark of 1,000 highly-aligned pairs for testing safety under distributional shift. Our work shows that respecting the geometry of pretrained representations is key to achieving safety without sacrificing performance.
arXiv.org Artificial Intelligence
Nov-24-2025
- Country:
- Europe > Switzerland (0.28)
- Genre:
- Research Report (1.00)
- Industry:
- Health & Medicine
- Consumer Health (0.46)
- Therapeutic Area > Psychiatry/Psychology (0.67)
- Law (0.67)
- Health & Medicine
- Technology: