Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition
Choi, Dasol, Lee, Seunghyun, Song, Youngsook
–arXiv.org Artificial Intelligence
Vision-Language Models (VLMs) have shown capabilities in interpreting visual content, but their reliability in safety-critical scenarios remains insufficiently explored. We introduce VERI, a diagnostic benchmark comprising 200 synthetic images (100 contrastive pairs) and an additional 50 real-world images (25 pairs) for validation. Each emergency scene is paired with a visually similar but safe counterpart through human verification. Using a two-stage evaluation protocol (risk identification and emergency response), we assess 17 VLMs across medical emergencies, accidents, and natural disasters. Our analysis reveals an "overreaction problem": models achieve high recall (70-100%) but suffer from low precision, misclassifying 31-96% of safe situations as dangerous. Seven safe scenarios were universally misclassified by all models. This "better-safe-than-sorry" bias stems from contextual overinterpretation (88-98% of errors). Both synthetic and real-world datasets confirm these systematic patterns, challenging VLM reliability in safety-critical applications. Addressing this requires enhanced contextual reasoning in ambiguous visual situations.
arXiv.org Artificial Intelligence
Sep-30-2025
- Country:
- Asia > South Korea
- North America > United States (0.04)
- Genre:
- Research Report > New Finding (0.46)
- Industry:
- Health & Medicine > Consumer Health (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Machine Learning
- Neural Networks > Deep Learning (0.71)
- Performance Analysis > Accuracy (0.96)
- Natural Language
- Chatbot (0.71)
- Large Language Model (1.00)
- Representation & Reasoning (1.00)
- Vision (1.00)
- Machine Learning
- Information Technology > Artificial Intelligence