FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Tu, Chongjun, Zhang, Lin, Chen, Pengtao, Ye, Peng, Zeng, Xianfang, Cheng, Wei, Yu, Gang, Chen, Tao

Mar-19-2025–arXiv.org Artificial Intelligence

Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models. Project page: \href{https://favor-bench.github.io/}{https://favor-bench.github.io/}.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

Mar-19-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China (0.14)
- Europe > Netherlands (0.14)

Genre:
- Research Report (0.70)

Industry:
- Education (0.49)
- Media > Television (0.32)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning (0.46)
  - Natural Language > Large Language Model (1.00)
  - Vision (1.00)