VideoHallu: Evaluating and Mitigating Multi-modal Hallucinations on Synthetic Video Understanding

Jun-18-2026, 06:55:21 GMT–Neural Information Processing Systems

Vision-Language Models (VLMs) have achieved remarkable success in video understanding tasks. Yet, a key question remains: do they comprehend visual information, or merely learn superficial mappings between visual and textual patterns? Understanding visual cues, particularly those related to physics and common sense, is crucial for AI systems interacting with the physical world. However, existing VLM evaluations primarily rely on positivecontrol tests using real-world videos that resemble training distributions. While VLMs perform well on such benchmarks, it is unclear whether they grasp underlying visual and contextual signals or simply exploit visual-language correlations. To fill this gap, we propose incorporating negative-control tests, i.e., videos depicting physically impossible or logically inconsistent scenarios, and evaluating whether models can recognize these violations.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Jun-18-2026, 06:55:21 GMT

Conferences PDF

Add feedback

Country:
- North America > United States (0.67)

Genre:
- Research Report
  - New Finding (1.00)
  - Experimental Study (1.00)

Industry:
- Education (0.46)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Cognitive Science > Problem Solving (0.68)
  - Natural Language
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found