Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

Liu, Huabin, Ilievski, Filip, Snoek, Cees G. M.

Jan-9-2025–arXiv.org Artificial Intelligence

This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.

entailment tree, reasoning, video, (14 more...)

arXiv.org Artificial Intelligence

Jan-9-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Netherlands
  - North Holland > Amsterdam (0.04)
- Asia > China
  - Shanghai > Shanghai (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)