Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark

Kim, Junsu, Kim, Naeun, Lee, Jaeho, Park, Incheol, Han, Dongyoon, Baek, Seungryul

Jul-18-2025–arXiv.org Artificial Intelligence

The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware multimodal large language models (MLLMs). Despite its significance, we identified critical reproducibility and benchmark-quality issues that hinder fair and consistent quantitative evaluations. Most notably, the benchmark utilizes different image indices from those of the original 3DPW dataset, forcing researchers into tedious and error-prone manual matching processes to obtain accurate ground-truth (GT) annotations for quantitative metrics ( e.g ., MPJPE, P A-MPJPE). Furthermore, our analysis reveals several inherent benchmark-quality limitations, including significant image redundancy, scenario imbalance, overly simplistic poses, and ambiguous textual descriptions, collectively undermining reliable evaluations across diverse scenarios. T o alleviate manual effort and enhance reproducibility, we carefully refined the GT annotations through meticulous visual matching and publicly release these refined annotations as an open-source resource, thereby promoting consistent quantitative evaluations and facilitating future advancements in human pose-aware mul-timodal reasoning.

benchmark, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jul-18-2025

arXiv.org PDF

Add feedback

Country:
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)

Genre:
- Research Report (0.64)

Technology:
- Information Technology > Artificial Intelligence
  - Natural Language > Large Language Model (0.70)
  - Vision > Video Understanding (0.73)