DeepSeek performs better than other Large Language Models in Dental Cases

Zhang, Hexian, Yan, Xinyu, Yang, Yanqi, Jin, Lijian, Yang, Ping, Wang, Junwen

arXiv.org Artificial Intelligence 

Division of Epidemiology, Department of Quantitative Health Sciences, Mayo Clinic, Scottsdale, AZ 85259, USA Hexian Zhang: chordzhang@connect.hku.hk Tel: (852) 2852 0128, Fax: (852) 2548 9464 A bstract word count: 1 85 T otal word count: 31 67 T otal number of tables: 2 T otal number of figures: 3 N umber of references: 32 Keywords Artificial Intelligence, Deep Learning/Machine Learning, Dental Education, Electronic dental records, Periodontal Medicine Abstract Aims: Periodontology, with its wealth of structured clinical data, offers an ideal setting to evaluate the reasoning abilities of large language models (LLMs). This study aims to assess four LLMs (GPT - 4o, Gemini 2.0 Flash, Copilot, and DeepSeek V3) in interpreting longitudinal periodontal case vignettes through open - ended tasks. Materials and Methods: Thirty - four standardized longitudinal periodontal case vignettes were curated, generating 258 open - ended question - answer pairs. Each model was prompted to review case details and produce responses. Performance was evaluated using automated metrics (faithfulness, answer relevancy, readability) and blinded assessments by licensed dentists on a five - point Likert scale. Results: DeepSeek V3 achieved the highest median faithfulness score (0.528), outperforming GPT - 4o (0.457), Gemini 2.0 Flash (0.421), and Copilot (0.367).