HourVideo: 1-Hour Video-Language Understanding

Mar-21-2025, 10:45:23 GMT–Neural Information Processing Systems

Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at hourvideo.stanford.edu.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Mar-21-2025, 10:45:23 GMT

Conferences PDF

Add feedback

Country:
- North America > United States > California > Santa Clara County > Palo Alto (0.25)

Genre:
- Research Report > New Finding (0.45)

Industry:
- Information Technology > Security & Privacy (0.46)

Technology:
- Information Technology
  - Artificial Intelligence
    - Cognitive Science (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)
    - Natural Language > Large Language Model (1.00)
    - Representation & Reasoning > Spatial Reasoning (0.66)
    - Robots (0.93)
    - Vision (1.00)
  - Human Computer Interaction > Interfaces
    - Virtual Reality (0.67)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found