VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding Houlun Chen
–Neural Information Processing Systems
Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video.
Neural Information Processing Systems
Feb-12-2026, 07:33:12 GMT