VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

Kugo, Noriyuki, Li, Xiang, Li, Zixin, Gupta, Ashish, Khatua, Arpandeep, Jain, Nidhish, Patel, Chaitanya, Kyuragi, Yuta, Ishii, Yasunori, Tanabiki, Masamoto, Kozuka, Kazuki, Adeli, Ehsan

May-1-2025–arXiv.org Artificial Intelligence

Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame-level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. T o address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

May-1-2025

arXiv.org PDF

Add feedback

Genre:
- Research Report (0.82)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning > Agents (1.00)
  - Natural Language (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.95)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found