Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

Xiong, Shengwu., Zou, Tianyu., Wang, Cong., Li, Xuelong

arXiv.org Artificial Intelligence 

Abstract--Evaluating multimodal large language models (MLLMs) remains a fundamental challenge due to a lack of structured, interpretable, and theoretically grounded benchmark designs. Existing benchmarks often adopt heuristic-based task groupings with unclear cognitive targets, thus resulting in overlapping abilities, redundant indicators, and limited diagnostic power . T o do as, we propose a novel framework for aligning MLLM benchmark based on structural equation modeling to analyze and quantify internal validity, dimensional separability, and contribution of benchmark components. Motivated by the observed limitations of current designs, we further introduce a novel capability hierarchy grounded in Piaget's theory of cognitive development, dividing MLLM abilities into three hierarchical layers, i.e., Perception, Memory, and Reasoning. HE rapid advancements in the field of multimodal learning have been driven by the emergence of increasingly powerful and versatile Multimodal Large Language Models (MLLMs) [1]-[3]. This work was supported in part by the National Key Research and Development Program of China under Grant No. 2022ZD0160604, in part of the National Natural Science Foundation of China under Grant 62476219, in part by the National Key R&D Program of Shanxi under Grant 2024CY2-GJHX-54, in part by the Y oung Talent Fund of Association for Science and Technology in Shaanxi, China under Grant 20230140, and in part by the Fundamental Funds for the Central Universities. Tianyu Zou is with the School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan 430070, China, also with Sanya Science and Education Innovation Park, Wuhan University of Technology, Sanya 572000, China, and also with the Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China. Cong Wang is with the School of Mathematics and Statistics, Northwestern Polytechnical University, Xi'an 710129, China, and also with Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China. Xuelong Li is with the Institute of Artificial Intelligence (TeleAI) of China Telecom and also with Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China. As MLLMs continue to evolve [10], [11], the need for comprehensive evaluation frameworks becomes increasingly critical to assess their reasoning abilities, multimodal understanding, and generalization performance [12], [13].