Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Li, Jialuo, Li, Bin, Li, Jiahao, Lu, Yan
–arXiv.org Artificial Intelligence
The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. W e first identify and validate a query typology distinguishing between global query and localized query. W e demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically, DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames for localized queries. Experiments on three long-form video understanding benchmarks demonstrate that DIG consistently outperforms existing baselines and robustly improves LMM performance, even when scaling the input frame count to 256.
arXiv.org Artificial Intelligence
Dec-4-2025
- Country:
- Asia (0.04)
- Europe > Italy
- Calabria > Catanzaro Province > Catanzaro (0.04)
- North America > United States (0.04)
- Genre:
- Research Report > New Finding (0.92)
- Technology: