Ma, Ting
Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?
Bassi, Pedro R. A. S., Li, Wenxuan, Tang, Yucheng, Isensee, Fabian, Wang, Zifu, Chen, Jieneng, Chou, Yu-Cheng, Kirchhoff, Yannick, Rokuss, Maximilian, Huang, Ziyan, Ye, Jin, He, Junjun, Wald, Tassilo, Ulrich, Constantin, Baumgartner, Michael, Roy, Saikat, Maier-Hein, Klaus H., Jaeger, Paul, Ye, Yiwen, Xie, Yutong, Zhang, Jianpeng, Chen, Ziyang, Xia, Yong, Xing, Zhaohu, Zhu, Lei, Sadegheih, Yousef, Bozorgpour, Afshin, Kumari, Pratibha, Azad, Reza, Merhof, Dorit, Shi, Pengcheng, Ma, Ting, Du, Yuxin, Bai, Fan, Huang, Tiejun, Zhao, Bo, Wang, Haonan, Li, Xiaomeng, Gu, Hanxue, Dong, Haoyu, Yang, Jichen, Mazurowski, Maciej A., Gupta, Saumya, Wu, Linshan, Zhuang, Jiaxin, Chen, Hao, Roth, Holger, Xu, Daguang, Blaschko, Matthew B., Decherchi, Sergio, Cavalli, Andrea, Yuille, Alan L., Zhou, Zongwei
How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain.
TopCoW: Benchmarking Topology-Aware Anatomical Segmentation of the Circle of Willis (CoW) for CTA and MRA
Yang, Kaiyuan, Musio, Fabio, Ma, Yihui, Juchler, Norman, Paetzold, Johannes C., Al-Maskari, Rami, Höher, Luciano, Li, Hongwei Bran, Hamamci, Ibrahim Ethem, Sekuboyina, Anjany, Shit, Suprosanna, Huang, Houjing, Waldmannstetter, Diana, Kofler, Florian, Navarro, Fernando, Menten, Martin, Ezhov, Ivan, Rueckert, Daniel, Vos, Iris, Ruigrok, Ynte, Velthuis, Birgitta, Kuijf, Hugo, Hämmerli, Julien, Wurster, Catherine, Bijlenga, Philippe, Westphal, Laura, Bisschop, Jeroen, Colombo, Elisa, Baazaoui, Hakim, Makmur, Andrew, Hallinan, James, Wiestler, Bene, Kirschke, Jan S., Wiest, Roland, Montagnon, Emmanuel, Letourneau-Guillon, Laurent, Galdran, Adrian, Galati, Francesco, Falcetta, Daniele, Zuluaga, Maria A., Lin, Chaolong, Zhao, Haoran, Zhang, Zehan, Ra, Sinyoung, Hwang, Jongyun, Park, Hyunjin, Chen, Junqiang, Wodzinski, Marek, Müller, Henning, Shi, Pengcheng, Liu, Wei, Ma, Ting, Yalçin, Cansu, Hamadache, Rachika E., Salvi, Joaquim, Llado, Xavier, Estrada, Uma Maria Lal-Trehan, Abramova, Valeriia, Giancardo, Luca, Oliver, Arnau, Liu, Jialu, Huang, Haibin, Cui, Yue, Lin, Zehang, Liu, Yusheng, Zhu, Shunzhi, Patel, Tatsat R., Tutino, Vincent M., Orouskhani, Maysam, Wang, Huayu, Mossa-Basha, Mahmud, Zhu, Chengcheng, Rokuss, Maximilian R., Kirchhoff, Yannick, Disch, Nico, Holzschuh, Julius, Isensee, Fabian, Maier-Hein, Klaus, Sato, Yuki, Hirsch, Sven, Wegener, Susanne, Menze, Bjoern
The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to affect the risk, severity, and clinical outcome of serious neuro-vascular diseases. However, characterizing the highly variable CoW anatomy is still a manual and time-consuming expert task. The CoW is usually imaged by two angiographic imaging modalities, magnetic resonance angiography (MRA) and computed tomography angiography (CTA), but there exist limited public datasets with annotations on CoW anatomy, especially for CTA. Therefore we organized the TopCoW Challenge in 2023 with the release of an annotated CoW dataset. The TopCoW dataset was the first public dataset with voxel-level annotations for thirteen possible CoW vessel components, enabled by virtual-reality (VR) technology. It was also the first large dataset with paired MRA and CTA from the same patients. TopCoW challenge formalized the CoW characterization problem as a multiclass anatomical segmentation task with an emphasis on topological metrics. We invited submissions worldwide for the CoW segmentation task, which attracted over 140 registered participants from four continents. The top performing teams managed to segment many CoW components to Dice scores around 90%, but with lower scores for communicating arteries and rare variants. There were also topological mistakes for predictions with high Dice scores. Additional topological analysis revealed further areas for improvement in detecting certain CoW components and matching CoW variant topology accurately. TopCoW represented a first attempt at benchmarking the CoW anatomical segmentation task for MRA and CTA, both morphologically and topologically.
Uncertainty Quantification in Medical Image Segmentation with Multi-decoder U-Net
Yang, Yanwu, Guo, Xutao, Pan, Yiwei, Shi, Pengcheng, Lv, Haiyan, Ma, Ting
Accurate medical image segmentation is crucial for diagnosis and analysis. However, the models without calibrated uncertainty estimates might lead to errors in downstream analysis and exhibit low levels of robustness. Estimating the uncertainty in the measurement is vital to making definite, informed conclusions. Especially, it is difficult to make accurate predictions on ambiguous areas and focus boundaries for both models and radiologists, even harder to reach a consensus with multiple annotations. In this work, the uncertainty under these areas is studied, which introduces significant information with anatomical structure and is as important as segmentation performance. We exploit the medical image segmentation uncertainty quantification by measuring segmentation performance with multiple annotations in a supervised learning manner and propose a U-Net based architecture with multiple decoders, where the image representation is encoded with the same encoder, and segmentation referring to each annotation is estimated with multiple decoders. Nevertheless, a cross loss function is proposed for bridging the gap between different branches. The proposed architecture is trained in an end-to-end manner and able to improve predictive uncertainty estimates. The model achieves comparable performance with fewer parameters to the integrated training model that ranked the runner-up in the MICCAI-QUBIQ 2020 challenge.