Large Language Models in the Clinic: A Comprehensive Benchmark