TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning

Sanders, Kate, Weir, Nathaniel, Van Durme, Benjamin

Mar-10-2024–arXiv.org Artificial Intelligence

It is challenging to perform question-answering over complex, multimodal content such as television clips. This is in part because current video-language models rely on single-modality reasoning, have lowered performance on long inputs, and lack interpetability. We propose TV-TREES, the first multimodal entailment tree generator. TV-TREES serves as an approach to video understanding that promotes interpretable joint-modality reasoning by producing trees of entailment relationships between simple premises directly entailed by the videos and higher-level conclusions. We then introduce the task of multimodal entailment tree generation to evaluate the reasoning quality of such methods. Our method's experimental results on the challenging TVQA dataset demonstrate intepretable, state-of-the-art zero-shot performance on full video clips, illustrating a best-of-both-worlds contrast to black-box methods.

dialogue, hypothesis, inference, (15 more...)

arXiv.org Artificial Intelligence

Mar-10-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Washington > King County > Seattle (0.04)
- Asia
  - Myanmar > Tanintharyi Region
    - Dawei (0.04)
  - Middle East > UAE
    - Abu Dhabi Emirate > Abu Dhabi (0.04)

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Natural Language > Large Language Model (0.92)
  - Cognitive Science > Problem Solving (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found