ConfProBench: A Confidence Evaluation Benchmark for MLLM-Based Process Judges

Zhou, Yue, Chang, Yi, Wu, Yuan

arXiv.org Artificial Intelligence 

Reasoning is the critical capability of multimodal large language models (MLLMs) to solve complex multimodal tasks, and judging the correctness of reasoning steps is crucial to improving this capability. Recently, MLLM-based process judges (MPJs) have been widely used to judge the correctness of reasoning steps in multimodal reasoning tasks. Therefore, evaluating the capability of MPJs is crucial for identifying their limitations and guiding future improvements. However, existing benchmarks for MPJs primarily focus on evaluating capabilities such as step correctness classification and reasoning process search, while overlooking a critical dimension: whether the confidence scores produced by MPJs at the step level are reliable. To fill this gap, we propose ConfProBench, the first comprehensive benchmark designed to systematically evaluate the reliability of step-level confidence scores generated by MPJs. This benchmark constructs three types of adversarially perturbed reasoning steps: Synonym Substitution, Syntactic Transformation, and Image Perturbation, to evaluate the robustness of MPJs' confidence under perturbations. Furthermore, we propose three novel evaluation metrics: Confidence Robustness Score (CRS), Confidence Sensitivity Score (CSS), and Confidence Calibration Score (CCS), which are designed to capture three complementary aspects of MPJs' confidence--robustness, sensitivity, and calibration. We evaluate 14 state-of-the-art MLLMs, including both proprietary and open-source models. Through extensive experiments, we reveal limitations in existing MPJs' confidence performance and provide competitive baselines, thereby paving the way for future research in this field.

Duplicate Docs Excel Report

Title
None found

Similar Docs  Excel Report  more

TitleSimilaritySource
None found