Towards Automatic Evaluation for LLMs' Clinical Capabilities: Metric, Data, and Algorithm