Cataract-LMM: Large-Scale, Multi-Source, Multi-Task Benchmark for Deep Learning in Surgical Video Analysis

Ahmadi, Mohammad Javad, Gandomi, Iman, Abdi, Parisa, Mohammadi, Seyed-Farzad, Taslimi, Amirhossein, Khodaparast, Mehdi, Hashemi, Hassan, Tavakoli, Mahdi, Taghirad, Hamid D.

arXiv.org Artificial Intelligence 

The persistent gap between the growing global surgical demand and the trained surgical workforce [1] highlights the need to develop scalable solutions that can enhance training paradigms and optimize workflow management [2]. Computer-assisted surgery (CAS) systems are one approach to address this challenge, with applications in preoperative planning [3], intraoperative guidance [4], and standardized postoperative assessment [5, 6]. The development and validation of these advanced CAS capabilities fundamentally depend on access to large-scale, deeply annotated surgical video datasets that capture procedural phases, instrument-tissue interactions, and technical skill cues [7, 8]. Phacoemulsification cataract surgery is the most common ophthalmic procedure worldwide and the primary intervention for avoidable blindness [9, 10]. This makes it a critical domain for developing data-driven CAS with potential applications in clinical workflows and training [11, 12]. Publicly available datasets for developing CAS in cataract surgery, such as Cataract-1K [13] and CaDIS [14], are limited by their single-center origin and limited annotation scopes [15]. The absence of a multi-source dataset with comprehensive and multi-layered annotations, including objective skill assessments, has limited the development of generalizable multi-task deep learning models [11]. To address this gap, we present the Cataract-LMM (Large-scale, Multi-source, Multi-task) Dataset, a dataset of 3,000 phacoemulsification procedures recorded at two distinct clinical centers (Farabi and Noor Eye Hospitals, Tehran, Iran) between December 2021 and March 2025. The dataset is enriched with four complementary layers of annotations on subsets of the data: 1. Temporal Phase Labels (Phase): Frame-wise annotations for 13 surgical phases across 150 videos to support automated workflow recognition.