Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
Lee, DongGeon, Jang, Joonwon, Jeong, Jihae, Yu, Hwanjo
–arXiv.org Artificial Intelligence
Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MemeSafetyBench is publicly available at https://github.com/oneonlee/Meme-Safety-Bench.
arXiv.org Artificial Intelligence
Sep-24-2025
- Country:
- North America > United States (1.00)
- Europe (1.00)
- Asia (1.00)
- Genre:
- Research Report > New Finding (1.00)
- Industry:
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area
- Psychiatry/Psychology (0.46)
- Technology: