MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding Xinyu Fang
–Neural Information Processing Systems
The advent of large vision-language models (L VLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding.
Neural Information Processing Systems
Feb-17-2026, 02:45:14 GMT