AI-Assisted Decision-Making for Clinical Assessment of Auto-Segmented Contour Quality

Wang, Biling, Maniscalco, Austen, Bai, Ti, Wang, Siqiu, Dohopolski, Michael, Lin, Mu-Han, Shen, Chenyang, Nguyen, Dan, Huang, Junzhou, Jiang, Steve, Wang, Xinlei

arXiv.org Artificial Intelligence 

Purpose: This study introduces a novel Deep Learning (DL) - based q uality a sses s ment (QA) approach specifically designed for evaluating auto - generated contours (auto - contour s) in auto - segmentation for radiotherapy, with a focus on Online Adaptive Radiotherapy (OART). The proposed method leverages Bayesian Ordinal Classification (BOC), combined with cali brated thresholds derived from uncertainty quantification, to deliver confident QA predictions . This approach address es key challenges in clinical auto - segmentation QA workflows such as the absence of ground truth contours, limited availability of manually labeled data, and inherent uncertainty in AI model predictions . Methods: We developed a BOC model to classify the quality of auto - contour s and quantify uncertainty. To enhance predictive reliability, we implemented a calibration step to determine optimal uncertainty thresholds that meet specific clinical accuracy requirements . The method was validated under three distinct data availability scenarios: absence of manual labels, limited manual labeling, and extensive manual labeling. We specifically tested our method for auto - segmented rectum contours in prostate cancer radiotherapy. Geometric surrogate labels were employed in the absence of manual labels, transfer learning was applied when manual labels were limited, and direct use of manual labels was perf ormed when extensive labeling was available. Results: The BOC model demonstrated robust performance across all data availability scenarios for confident predictions, with significant accuracy gains when pre - trained with surrogate labels and fine - tuned with limited manual ly label ed data . Specifically, fine - tuning the pretrained model with just 30 manually labeled cases and calibrating with 34 subjects achieved over an accuracy of over 90% against manual labels in the test dataset .