Can OpenAI o1 Reason Well in Ophthalmology? A 6,990-Question Head-to-Head Evaluation Study
Srinivasan, Sahana, Ai, Xuguang, Zou, Minjie, Zou, Ke, Kim, Hyunjae, Lo, Thaddaeus Wai Soon, Pushpanathan, Krithi, Kong, Yiming, Li, Anran, Singer, Maxwell, Jin, Kai, Antaki, Fares, Chen, David Ziyou, Liu, Dianbo, Adelman, Ron A., Chen, Qingyu, Tham, Yih Chung
–arXiv.org Artificial Intelligence
Question: What is the performance and reasoning ability of OpenAI o1 compared to other large language models in addressing ophthalmology-specific questions? Findings: This study evaluated OpenAI o1 and five LLMs using 6,990 ophthalmological questions from MedMCQA. O1 achieved the highest accuracy (0.88) and macro-F1 score but ranked third in reasoning capabilities based on text-generation metrics. Across subtopics, o1 ranked first in ``Lens'' and ``Glaucoma'' but second to GPT-4o in ``Corneal and External Diseases'', ``Vitreous and Retina'' and ``Oculoplastic and Orbital Diseases''. Subgroup analyses showed o1 performed better on queries with longer ground truth explanations. Meaning: O1's reasoning enhancements may not fully extend to ophthalmology, underscoring the need for domain-specific refinements to optimize performance in specialized fields like ophthalmology.
arXiv.org Artificial Intelligence
Jan-19-2025
- Country:
- Asia > China
- Zhejiang Province (0.14)
- North America > United States (0.28)
- Asia > China
- Genre:
- Research Report
- Experimental Study (1.00)
- New Finding (0.96)
- Research Report
- Industry:
- Technology: