Benchmarking Large Language Models on CMExam - A Comprehensive Chinese Medical Exam Dataset

Neural Information Processing Systems 

The results show that GPT -4 had the best accuracy of 61.6% and a weighted F1

Similar Docs  Excel Report  more

TitleSimilaritySource
None found