Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark