NeMo: Needle in a Montage for Video-Language Understanding

Hu, Zi-Yuan, Liang, Shuo, Zheng, Duo, Li, Yanyang, Tao, Yeyao, Huang, Shijia, Feng, Wei, Qin, Jia, Yu, Jianguang, Huang, Jing, Fang, Meng, Li, Yin, Wang, Liwei

arXiv.org Artificial Intelligence 

Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Ne edle in a Mo ntage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found