VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding Houlun Chen

Neural Information Processing Systems 

Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found