HourVideo: 1-Hour Video-Language Understanding
–Neural Information Processing Systems
Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at hourvideo.stanford.edu.
Neural Information Processing Systems
Mar-21-2025, 10:45:23 GMT
- Country:
- North America > United States > California > Santa Clara County > Palo Alto (0.25)
- Genre:
- Research Report > New Finding (0.45)
- Industry:
- Information Technology > Security & Privacy (0.46)
- Technology:
- Information Technology
- Artificial Intelligence
- Cognitive Science (0.93)
- Machine Learning > Neural Networks
- Deep Learning (1.00)
- Natural Language > Large Language Model (1.00)
- Representation & Reasoning > Spatial Reasoning (0.66)
- Robots (0.93)
- Vision (1.00)
- Human Computer Interaction > Interfaces
- Virtual Reality (0.67)
- Artificial Intelligence
- Information Technology