Can OpenAI o1 Reason Well in Ophthalmology? A 6,990-Question Head-to-Head Evaluation Study

Srinivasan, Sahana, Ai, Xuguang, Zou, Minjie, Zou, Ke, Kim, Hyunjae, Lo, Thaddaeus Wai Soon, Pushpanathan, Krithi, Kong, Yiming, Li, Anran, Singer, Maxwell, Jin, Kai, Antaki, Fares, Chen, David Ziyou, Liu, Dianbo, Adelman, Ron A., Chen, Qingyu, Tham, Yih Chung

Jan-19-2025–arXiv.org Artificial Intelligence

Question: What is the performance and reasoning ability of OpenAI o1 compared to other large language models in addressing ophthalmology-specific questions? Findings: This study evaluated OpenAI o1 and five LLMs using 6,990 ophthalmological questions from MedMCQA. O1 achieved the highest accuracy (0.88) and macro-F1 score but ranked third in reasoning capabilities based on text-generation metrics. Across subtopics, o1 ranked first in ``Lens'' and ``Glaucoma'' but second to GPT-4o in ``Corneal and External Diseases'', ``Vitreous and Retina'' and ``Oculoplastic and Orbital Diseases''. Subgroup analyses showed o1 performed better on queries with longer ground truth explanations. Meaning: O1's reasoning enhancements may not fully extend to ophthalmology, underscoring the need for domain-specific refinements to optimize performance in specialized fields like ophthalmology.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

Jan-19-2025

arXiv.org PDF

Add feedback

Country:
- Asia > China
  - Zhejiang Province (0.14)
- North America > United States (0.28)

Genre:
- Research Report
  - Experimental Study (1.00)
  - New Finding (0.96)

Industry:
- Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (1.00)

Technology:
- Information Technology > Artificial Intelligence
  - Machine Learning > Neural Networks
    - Deep Learning > Generative AI (0.85)
  - Natural Language
    - Chatbot (1.00)
    - Large Language Model (1.00)