Goto

Collaborating Authors

 ophthalmology



Control Modes of Teleoperated Surgical Robotic System's Tools in Ophthalmic Surgery

Wang, Haoran, Foroutani, Yasamin, Nepo, Matthew, Rodriguez, Mercedes, Ma, Ji, Hubschman, Jean-Pierre, Tsao, Tsu-Chin, Rosen, Jacob

arXiv.org Artificial Intelligence

Abstract--The introduction of a teleoperated surgical robotic system designed for minimally invasive procedures enables the emulation of two distinct control modes through a dedicated input device of the surgical console: (1) Inside Control Mode, which emulates tool manipulation near the distal end (i.e., as if the surgeon was holding the tip of the instrument inside the patient's body), and (2) Outside Control Mode, which emulates manipulation near the proximal end (i.e., as if the surgeon was holding the tool externally). The overarching aim of this reported research is to study and compare the surgeon's performance utilizing these two control modes of operation along with various scaling factors in a simulated vitreoretinal surgical setting. The console of Intraocular Robotic Interventional Surgical System (IRISS) was utilized but the surgical robot itself and the human eye anatomy was simulated by a virtual environment (VR) projected microscope view of an intraocular setup to a VR headset. Five experienced vitreoretinal surgeons and five subjects with no surgical experience used the system to perform fundamental tool/tissue tasks common to vitreoretinal surgery including: (1) touch and reset; (2) grasp and drop; (3) inject; (4) circular tracking. The results indicate that Inside Control outperforms Outside Control across multiple tasks and performance metrics. Higher scaling factors (20 and 30) generally provided better performance, particularly for reducing trajectory errors and tissue damage. This improvement suggests that larger scaling factors enable more precise control, making them the preferred option for fine manipulation tasks. However, task completion time was not consistently reduced across all conditions, indicating that surgeons may need to balance speed and accuracy/precision based on specific surgical requirements. By optimizing control dynamics and user interface, robotic teleoperation has the potential to reduce complications, enhance surgical dexterity, and expand the accessibility of high-precision procedures to a broader range of practitioners. In Minimally Invasive Surgery (MIS), surgical instruments are introduced into the body through small ports established at the skin surface or, in the case of ophthalmic procedures, through specific ocular tissues such as the sclera, cornea, or conjunctiva. Unlike open surgery, where the surgeon may manipulate the tool from any position along its shaft--including proximally or distally--MIS confines the surgeon's interaction to the proximal end of the tool, which remains external to the patient's body, while the distal end performs the intervention through the fixed port.


Fusing Structural Phenotypes with Functional Data for Early Prediction of Primary Angle Closure Glaucoma Progression

Sharma, Swati, Chuangsuwanich, Thanadet, Tan, Royston K. Y., Prasad, Shimna C., Tun, Tin A., Perera, Shamira A., Buist, Martin L., Aung, Tin, Nongpiur, Monisha E., Girard, Michaël J. A.

arXiv.org Artificial Intelligence

Purpose: To classify eyes as slow or fast glaucoma progressors in patients with primary angle closure glaucoma (PACG) using an integrated approach combining optic nerve head (ONH) structural features and sector-based visual field (VF) functional parameters. Methods: PACG patients with >5 reliable VF tests over >5 years were included. Progression was assessed in Zeiss Forum, with baseline VF within six months of OCT. Fast progression was VFI decline <-2.0% per year; slow progression >-2.0% per year. OCT volumes were AI-segmented to extract 31 ONH parameters. The Glaucoma Hemifield Test defined five regions per hemifield, aligned with RNFL distribution. Mean sensitivity per region was combined with structural parameters to train ML classifiers. Multiple models were tested, and SHAP identified key predictors. Main outcome measures: Classification of slow versus fast progressors using combined structural and functional data. Results: We analyzed 451 eyes from 299 patients. Mean VFI progression was -0.92% per year; 369 eyes progressed slowly and 82 rapidly. The Random Forest model combining structural and functional features achieved the best performance (AUC = 0.87, 2000 Monte Carlo iterations). SHAP identified six key predictors: inferior MRW, inferior and inferior-temporal RNFL thickness, nasal-temporal LC curvature, superior nasal VF sensitivity, and inferior RNFL and GCL+IPL thickness. Models using only structural or functional features performed worse with AUC of 0.82 and 0.78, respectively. Conclusions: Combining ONH structural and VF functional parameters significantly improves classification of progression risk in PACG. Inferior ONH features, MRW and RNFL thickness, were the most predictive, highlighting the critical role of ONH morphology in monitoring disease progression.


Performance of GPT-5 Frontier Models in Ophthalmology Question Answering

Antaki, Fares, Mikhail, David, Milad, Daniel, Mammo, Danny A, Sharma, Sumit, Srivastava, Sunil K, Chen, Bing Yu, Touma, Samir, Sevgi, Mertcan, El-Khoury, Jonathan, Keane, Pearse A, Chen, Qingyu, Tham, Yih Chung, Duval, Renaud

arXiv.org Artificial Intelligence

Importance: Novel large language models (LLMs) such as GPT-5 integrate advanced reasoning capabilities that may enhance performance on complex medical question-answering tasks. For this latest generation of reasoning models, the configurations that maximize both accuracy and cost-efficiency have yet to be established. Objective: To evaluate the performance and cost-accuracy trade-offs of OpenAI's GPT-5 compared to previous generation LLMs on ophthalmological question answering. Design, Setting, and Participants: In August 2025, 12 configurations of OpenAI's GPT-5 series (three model tiers across four reasoning effort settings) were evaluated alongside o1-high, o3-high, and GPT-4o, using 260 closed-access multiple-choice questions from the AAO Basic Clinical Science Course (BCSC) dataset. The study did not include human participants. Main Outcomes and Measures: The primary outcome was accuracy on the 260-item ophthalmology multiple-choice question set for each model configuration. Secondary outcomes included head-to-head ranking of configurations using a Bradley-Terry (BT) model applied to paired win/loss comparisons of answer accuracy, and evaluation of generated natural language rationales using a reference-anchored, pairwise LLM-as-a-judge framework. Additional analyses assessed the accuracy-cost trade-off by calculating mean per-question cost from token usage and identifying Pareto-efficient configurations. Results: The configuration GPT-5-high achieved the highest accuracy (0.965; 95% CI, 0.942-0.985),



A Survey of Multimodal Ophthalmic Diagnostics: From Task-Specific Approaches to Foundational Models

Luo, Xiaoling, Zheng, Ruli, Zheng, Qiaojian, Du, Zibo, Yang, Shuo, Ding, Meidan, Xu, Qihao, Liu, Chengliang, Shen, Linlin

arXiv.org Artificial Intelligence

Visual impairment represents a major global health challenge, with multimodal imaging providing complementary information that is essential for accurate ophthalmic diagnosis. This comprehensive survey systematically reviews the latest advances in multimodal deep learning methods in ophthalmology up to the year 2025. The review focuses on two main categories: task-specific multimodal approaches and large-scale multimodal foundation models. Task-specific approaches are designed for particular clinical applications such as lesion detection, disease diagnosis, and image synthesis. These methods utilize a variety of imaging modalities including color fundus photography, optical coherence tomography, and angiography. On the other hand, foundation models combine sophisticated vision-language architectures and large language models pretrained on diverse ophthalmic datasets. These models enable robust cross-modal understanding, automated clinical report generation, and decision support. The survey critically examines important datasets, evaluation metrics, and methodological innovations including self-supervised learning, attention-based fusion, and contrastive alignment. It also discusses ongoing challenges such as variability in data, limited annotations, lack of interpretability, and issues with generalizability across different patient populations. Finally, the survey outlines promising future directions that emphasize the use of ultra-widefield imaging and reinforcement learning-based reasoning frameworks to create intelligent, interpretable, and clinically applicable AI systems for ophthalmology.


BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

Srinivasan, Sahana, Ai, Xuguang, Lo, Thaddaeus Wai Soon, Gilson, Aidan, Zou, Minjie, Zou, Ke, Kim, Hyunjae, Yang, Mingjia, Pushpanathan, Krithi, Yew, Samantha, Loke, Wan Ting, Goh, Jocelyn, Chen, Yibing, Kong, Yiming, Fu, Emily Yuelei, Hui, Michelle Ongyong, Nwanyanwu, Kristen, Dave, Amisha, Li, Kelvin Zhenghao, Sun, Chen-Hsin, Chia, Mark, Yang, Gabriel Dawei, Wong, Wendy Meihua, Chen, David Ziyou, Liu, Dianbo, Singer, Maxwell, Antaki, Fares, Del Priore, Lucian V, Jonas, Jost, Adelman, Ron, Chen, Qingyu, Tham, Yih-Chung

arXiv.org Artificial Intelligence

Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ's correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO's utility, we evaluated six LLMs (OpenAI o1, o3-mini, GPT-4o, DeepSeek-R1, Llama-3-8B, and Gemini 1.5 Pro) using accuracy, macro-F1, and five text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore). In a further evaluation involving human experts, two ophthalmologists qualitatively reviewed 50 randomly selected outputs for accuracy, comprehensiveness, and completeness. BELO consists of 900 high-quality, expert-reviewed questions aggregated from five sources: BCSC (260), BioASQ (10), MedMCQA (572), MedQA (40), and PubMedQA (18). A public leaderboard has been established to promote transparent evaluation and reporting. Importantly, the BELO dataset will remain a hold-out, evaluation-only benchmark to ensure fair and reproducible comparisons of future models.


Benchmarking Large Multimodal Models for Ophthalmic Visual Question Answering with OphthalWeChat

Xu, Pusheng, Gong, Xia, Chen, Xiaolan, Zhang, Weiyi, Yang, Jiancheng, Yan, Bingjie, Yuan, Meng, Zheng, Yalin, He, Mingguang, Shi, Danli

arXiv.org Artificial Intelligence

Purpose: To develop a bilingual multimodal visual question answering (VQA) benchmark for evaluating VLMs in ophthalmology. Methods: Ophthalmic image posts and associated captions published between January 1, 2016, and December 31, 2024, were collected from WeChat Official Accounts. Based on these captions, bilingual question-answer (QA) pairs in Chinese and English were generated using GPT-4o-mini. QA pairs were categorized into six subsets by question type and language: binary (Binary_CN, Binary_EN), single-choice (Single-choice_CN, Single-choice_EN), and open-ended (Open-ended_CN, Open-ended_EN). The benchmark was used to evaluate the performance of three VLMs: GPT-4o, Gemini 2.0 Flash, and Qwen2.5-VL-72B-Instruct. Results: The final OphthalWeChat dataset included 3,469 images and 30,120 QA pairs across 9 ophthalmic subspecialties, 548 conditions, 29 imaging modalities, and 68 modality combinations. Gemini 2.0 Flash achieved the highest overall accuracy (0.548), outperforming GPT-4o (0.522, P < 0.001) and Qwen2.5-VL-72B-Instruct (0.514, P < 0.001). It also led in both Chinese (0.546) and English subsets (0.550). Subset-specific performance showed Gemini 2.0 Flash excelled in Binary_CN (0.687), Single-choice_CN (0.666), and Single-choice_EN (0.646), while GPT-4o ranked highest in Binary_EN (0.717), Open-ended_CN (BLEU-1: 0.301; BERTScore: 0.382), and Open-ended_EN (BLEU-1: 0.183; BERTScore: 0.240). Conclusions: This study presents the first bilingual VQA benchmark for ophthalmology, distinguished by its real-world context and inclusion of multiple examinations per patient. The dataset reflects authentic clinical decision-making scenarios and enables quantitative evaluation of VLMs, supporting the development of accurate, specialized, and trustworthy AI systems for eye care.


RetSTA: An LLM-Based Approach for Standardizing Clinical Fundus Image Reports

Cai, Jiushen, Zhang, Weihang, Liu, Hanruo, Wang, Ningli, Li, Huiqi

arXiv.org Artificial Intelligence

Standardization of clinical reports is crucial for improving the quality of healthcare and facilitating data integration. The lack of unified standards, including format, terminology, and style, is a great challenge in clinical fundus diagnostic reports, which increases the difficulty for large language models (LLMs) to understand the data. To address this, we construct a bilingual standard terminology, containing fundus clinical terms and commonly used descriptions in clinical diagnosis. Then, we establish two models, RetSTA-7B-Zero and RetSTA-7B. RetSTA-7B-Zero, fine-tuned on an augmented dataset simulating clinical scenarios, demonstrates powerful standardization behaviors. However, it encounters a challenge of limitation to cover a wider range of diseases. To further enhance standardization performance, we build RetSTA-7B, which integrates a substantial amount of standardized data generated by RetSTA-7B-Zero along with corresponding English data, covering diverse complex clinical scenarios and achieving report-level standardization for the first time. Experimental results demonstrate that RetSTA-7B outperforms other compared LLMs in bilingual standardization task, which validates its superior performance and generalizability.


A Novel Ophthalmic Benchmark for Evaluating Multimodal Large Language Models with Fundus Photographs and OCT Images

Liang, Xiaoyi, Bian, Mouxiao, Chen, Moxin, Liu, Lihao, He, Junjun, Xu, Jie, Li, Lin

arXiv.org Artificial Intelligence

In recent years, large language models (LLMs) have demonstrated remarkable potential across various medical applications. Building on this foundation, multimodal large language models (MLLMs) integrate LLMs with visual models to process diverse inputs, including clinical data and medical images. In ophthalmology, LLMs have been explored for analyzing optical coherence tomography (OCT) reports, assisting in disease classification, and even predicting treatment outcomes. However, existing MLLM benchmarks often fail to capture the complexities of real-world clinical practice, particularly in the analysis of OCT images. Many suffer from limitations such as small sample sizes, a lack of diverse OCT datasets, and insufficient expert validation. These shortcomings hinder the accurate assessment of MLLMs' ability to interpret OCT scans and their broader applicability in ophthalmology. Our dataset, curated through rigorous quality control and expert annotation, consists of 439 fundus images and 75 OCT images. Using a standardized API-based framework, we assessed seven mainstream MLLMs and observed significant variability in diagnostic accuracy across different diseases. While some models performed well in diagnosing conditions such as diabetic retinopathy and age-related macular degeneration, they struggled with others, including choroidal neovascularization and myopia, highlighting inconsistencies in performance and the need for further refinement. Our findings emphasize the importance of developing clinically relevant benchmarks to provide a more accurate assessment of MLLMs' capabilities. By refining these models and expanding their scope, we can enhance their potential to transform ophthalmic diagnosis and treatment.