A Benchmark Suite for Reasoning-Across-Time in Videos Jr-Jen Chen 1 Y u-Chien Liao 1

Oct-9-2025, 22:46:23 GMT–Neural Information Processing Systems

This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.

language model, reasoning, xt ime, (13 more...)

Neural Information Processing Systems

Oct-9-2025, 22:46:23 GMT

Conferences PDF

Add feedback

Country:
- Asia > Taiwan (0.04)

Genre:
- Research Report (0.68)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language > Large Language Model (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (0.94)

Duplicate Docs Excel Report

Title
A Benchmark Suite for Reasoning-Across-Time in Videos Jr-Jen Chen 1 Y u-Chien Liao 1

Similar Docs Excel Report more

Title	Similarity	Source
None found