Automating Expert-Level Medical Reasoning Evaluation of Large Language Models