HourVideo: 1-Hour Video-Language Understanding

Mar-20-2026, 20:47:31 GMT–Neural Information Processing Systems

Our dataset consists of a novel task suite comprising summarization, perception ( , ), visual reasoning ( , , , , ), and navigation ( , ) tasks. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0\% vs. 37.3\%), highlighting a substantial gap in multimodal capabilities.

artificial intelligence, name change, proceedings, (4 more...)

Neural Information Processing Systems

Mar-20-2026, 20:47:31 GMT

Conferences Web Page

Add feedback

Technology:
- Information Technology > Artificial Intelligence (0.72)