VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding
–Neural Information Processing Systems
Vision-Language Models (VLMs) have achieved remarkable success in video understanding tasks. Yet, a key question remains: do they comprehend visual information, or merely learn superficial mappings between visual and textual patterns? Understanding visual cues, particularly those related to physics and common sense, is crucial for AI systems interacting with the physical world. However, existing VLM evaluations primarily rely on positivecontrol tests using real-world videos that resemble training distributions. While VLMs perform well on such benchmarks, it is unclear whether they grasp underlying visual and contextual signals or simply exploit visual-language correlations. To fill this gap, we propose incorporating negative-control tests, i.e., videos depicting physically impossible or logically inconsistent scenarios, and evaluating whether models can recognize these violations.
Neural Information Processing Systems
Jun-18-2026, 06:55:21 GMT
- Country:
- North America > United States (0.67)
- Genre:
- Research Report
- New Finding (1.00)
- Experimental Study (1.00)
- Research Report
- Industry:
- Education (0.46)
- Technology:
- Information Technology > Artificial Intelligence
- Vision (1.00)
- Cognitive Science > Problem Solving (0.68)
- Natural Language
- Large Language Model (1.00)
- Chatbot (1.00)
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Information Technology > Artificial Intelligence