CinePile: A Long Video Question Answering Dataset and Benchmark

Rawal, Ruchit, Saifullah, Khalid, Basri, Ronen, Jacobs, David, Somepalli, Gowthami, Goldstein, Tom

Jun-14-2024–arXiv.org Artificial Intelligence

Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jun-14-2024

arXiv.org PDF

Add feedback

Country:
- Europe (0.28)
- North America > United States
  - Maryland (0.14)

Genre:
- Research Report > Promising Solution (0.34)

Industry:
- Leisure & Entertainment (1.00)
- Media > Film (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.97)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)
  - Vision (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found