Towards Fine-Grained Video Question Answering

Dai, Wei, Luo, Alan, Durante, Zane, Dash, Debadutta, Milstein, Arnold, Schulman, Kevin, Adeli, Ehsan, Fei-Fei, Li

Mar-9-2025–arXiv.org Artificial Intelligence

In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.

large language model, machine learning, question answering, (18 more...)

arXiv.org Artificial Intelligence

Mar-9-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > Hawaii (0.14)

Genre:
- Research Report > New Finding (0.46)

Industry:
- Education (0.46)
- Health & Medicine (0.68)
- Leisure & Entertainment > Sports (0.47)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.93)
  - Natural Language
    - Large Language Model (0.68)
    - Question Answering (1.00)