VideoExplorer: Think With Videos For Agentic Long-Video Understanding

Yuan, Huaying, Liu, Zheng, Zhou, Junjie, Qian, Hongjin, Shu, Yan, Sebe, Nicu, Wen, Ji-Rong, Dou, Zhicheng

Nov-4-2025–arXiv.org Artificial Intelligence

Long-video understanding (L VU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of "thinking with video", which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. T o address the lack of L VU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer's significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Nov-4-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.46)

Genre:
- Research Report (0.66)

Technology:
- Information Technology > Artificial Intelligence
  - Vision > Video Understanding (1.00)
  - Natural Language > Large Language Model (1.00)
  - Representation & Reasoning > Agents (0.68)
  - Machine Learning > Neural Networks
    - Deep Learning (0.48)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found