MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding Xinyu Fang

Neural Information Processing Systems 

The advent of large vision-language models (L VLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding.

Similar Docs  Excel Report  more

TitleSimilaritySource
None found