Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Hu, Kairui, Wu, Penghao, Pu, Fanyi, Xiao, Wang, Zhang, Yuanhan, Yue, Xiang, Li, Bo, Liu, Ziwei
–arXiv.org Artificial Intelligence
Humans acquire knowledge through three cognitive stages: perceiving information, comprehending knowledge, and adapting knowledge to solve novel problems. Videos serve as an effective medium for this learning process, facilitating a progression through these cognitive stages. However, existing video benchmarks fail to systematically evaluate the knowledge acquisition capabilities in Large Multimodal Models (LMMs). To address this gap, we introduce Video-MMMU, a multi-modal, multi-disciplinary benchmark designed to assess LMMs' ability to acquire and utilize knowledge from videos. Video-MMMU features a curated collection of 300 expert-level videos and 900 human-annotated questions across six disciplines, evaluating knowledge acquisition through stage-aligned question-answer pairs: Perception, Comprehension, and Adaptation. A proposed knowledge gain metric, {\Delta}knowledge, quantifies improvement in performance after video viewing. Evaluation of LMMs reveals a steep decline in performance as cognitive demands increase and highlights a significant gap between human and model knowledge acquisition, underscoring the need for methods to enhance LMMs' capability to learn and adapt from videos.
arXiv.org Artificial Intelligence
Jan-23-2025
- Country:
- North America > Mexico > Mexico City (0.14)
- Genre:
- Instructional Material > Course Syllabus & Notes (0.47)
- Research Report (0.63)
- Industry:
- Education (1.00)
- Health & Medicine (1.00)
- Technology: