Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

Guo, Weiyu, Chen, Ziyang, Wang, Shaoguang, He, Jianxiang, Xu, Yijie, Ye, Jinhui, Sun, Ying, Xiong, Hui

Mar-17-2025–arXiv.org Artificial Intelligence

Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to ``finding a needle in a haystack.'' To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.

large language model, machine learning, relation, (23 more...)

arXiv.org Artificial Intelligence

Mar-17-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China > Shanghai > Shanghai (0.04)

Genre:
- Research Report (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Vision (1.00)
  - Cognitive Science (0.93)
  - Representation & Reasoning
    - Temporal Reasoning (0.68)
    - Spatial Reasoning (0.46)
    - Search (0.46)
  - Natural Language
    - Large Language Model (0.71)
    - Question Answering (0.67)
    - Text Processing (0.46)
  - Machine Learning > Neural Networks
    - Deep Learning (0.47)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found