HourVideo: 1-Hour Video-Language Understanding

Neural Information Processing Systems 

Our dataset consists of a novel task suite comprising summarization, perception ( , ), visual reasoning ( , , , , ), and navigation ( , ) tasks. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0\% vs. 37.3\%), highlighting a substantial gap in multimodal capabilities.