NeMo: Needle in a Montage for Video-Language Understanding
Hu, Zi-Yuan, Liang, Shuo, Zheng, Duo, Li, Yanyang, Tao, Yeyao, Huang, Shijia, Feng, Wei, Qin, Jia, Yu, Jianguang, Huang, Jing, Fang, Meng, Li, Yin, Wang, Liwei
–arXiv.org Artificial Intelligence
Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Ne edle in a Mo ntage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations.
arXiv.org Artificial Intelligence
Oct-14-2025