Cheplygina, Veronika
In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review
Jiménez-Sánchez, Amelia, Avlona, Natalia-Rozalia, de Boer, Sarah, Campello, Víctor M., Feragen, Aasa, Ferrante, Enzo, Ganz, Melanie, Gichoya, Judy Wawira, González, Camila, Groefsema, Steff, Hering, Alessa, Hulman, Adam, Joskowicz, Leo, Juodelyte, Dovile, Kandemir, Melih, Kooi, Thijs, Lérida, Jorge del Pozo, Li, Livie Yumeng, Pacheco, Andre, Rädsch, Tim, Reyes, Mauricio, Sourget, Théo, van Ginneken, Bram, Wen, David, Weng, Nina, Xu, Jack Junchi, Zając, Hubert Dariusz, Zuluaga, Maria A., Cheplygina, Veronika
Datasets play a critical role in medical imaging research, yet issues such as label quality, shortcuts, and metadata are often overlooked. This lack of attention may harm the generalizability of algorithms and, consequently, negatively impact patient outcomes. While existing medical imaging literature reviews mostly focus on machine learning (ML) methods, with only a few focusing on datasets for specific applications, these reviews remain static -- they are published once and not updated thereafter. This fails to account for emerging evidence, such as biases, shortcuts, and additional annotations that other researchers may contribute after the dataset is published. We refer to these newly discovered findings of datasets as research artifacts. To address this gap, we propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications. Our approach includes a framework for the living review to monitor data documentation artifacts, and an SQL database to visualize the citation relationships between research artifact and dataset. Lastly, we discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle. Our demo is publicly available at http://130.226.140.142.
Confidence intervals uncovered: Are we ready for real-world medical imaging AI?
Christodoulou, Evangelia, Reinke, Annika, Houhou, Rola, Kalinowski, Piotr, Erkan, Selen, Sudre, Carole H., Burgos, Ninon, Boutaj, Sofiène, Loizillon, Sophie, Solal, Maëlys, Rieke, Nicola, Cheplygina, Veronika, Antonelli, Michela, Mayer, Leon D., Tizabi, Minu D., Cardoso, M. Jorge, Simpson, Amber, Jäger, Paul F., Kopp-Schneider, Annette, Varoquaux, Gaël, Colliot, Olivier, Maier-Hein, Lena
Medical imaging is spearheading the AI transformation of healthcare. Performance reporting is key to determine which methods should be translated into clinical practice. Frequently, broad conclusions are simply derived from mean performance values. In this paper, we argue that this common practice is often a misleading simplification as it ignores performance variability. Our contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n = 221) published in 2023, we first observe that more than 50% of papers do not assess performance variability at all. Moreover, only one (0.5%) paper reported confidence intervals (CIs) for model performance. (2) To address the reporting bottleneck, we show that the unreported standard deviation (SD) in segmentation papers can be approximated by a second-order polynomial function of the mean Dice similarity coefficient (DSC). Based on external validation data from 56 previous MICCAI challenges, we demonstrate that this approximation can accurately reconstruct the CI of a method using information provided in publications. (3) Finally, we reconstructed 95% CIs around the mean DSC of MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three times larger than the median performance gap between the first and second ranked method. For more than 60% of papers, the mean performance of the second-ranked method was within the CI of the first-ranked method. We conclude that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.
Source Matters: Source Dataset Impact on Model Robustness in Medical Imaging
Juodelyte, Dovile, Lu, Yucheng, Jiménez-Sánchez, Amelia, Bottazzi, Sabrina, Ferrante, Enzo, Cheplygina, Veronika
Transfer learning has become an essential part of medical imaging classification algorithms, often leveraging ImageNet weights. However, the domain shift from natural to medical images has prompted alternatives such as RadImageNet, often demonstrating comparable classification performance. However, it remains unclear whether the performance gains from transfer learning stem from improved generalization or shortcut learning. To address this, we investigate potential confounders -- whether synthetic or sampled from the data -- across two publicly available chest X-ray and CT datasets. We show that ImageNet and RadImageNet achieve comparable classification performance, yet ImageNet is much more prone to overfitting to confounders. We recommend that researchers using ImageNet-pretrained models reexamine their model robustness by conducting similar experiments. Our code and experiments are available at https://github.com/DovileDo/source-matters.
Revisiting Hidden Representations in Transfer Learning for Medical Imaging
Juodelyte, Dovile, Jiménez-Sánchez, Amelia, Cheplygina, Veronika
While a key component to the success of deep learning is the availability of massive amounts of training data, medical image datasets are often limited in diversity and size. Transfer learning has the potential to bridge the gap between related yet different domains. For medical applications, however, it remains unclear whether it is more beneficial to pre-train on natural or medical images. We aim to shed light on this problem by comparing initialization on ImageNet and RadImageNet on seven medical classification tasks. Our work includes a replication study, which yields results contrary to previously published findings. In our experiments, ResNet50 models pre-trained on ImageNet tend to outperform those trained on RadImageNet. To gain further insights, we investigate the learned representations using Canonical Correlation Analysis (CCA) and compare the predictions of the different models. Our results indicate that, contrary to intuition, ImageNet and RadImageNet may converge to distinct intermediate representations, which appear to diverge further during fine-tuning. Despite these distinct representations, the predictions of the models remain similar. Our findings show that the similarity between networks before and after fine-tuning does not correlate with performance gains, suggesting that the advantages of transfer learning might not solely originate from the reuse of features in the early layers of a convolutional neural network.
Biomedical image analysis competitions: The state of current participation practice
Eisenmann, Matthias, Reinke, Annika, Weru, Vivienn, Tizabi, Minu Dietlinde, Isensee, Fabian, Adler, Tim J., Godau, Patrick, Cheplygina, Veronika, Kozubek, Michal, Ali, Sharib, Gupta, Anubha, Kybic, Jan, Noble, Alison, de Solórzano, Carlos Ortiz, Pachade, Samiksha, Petitjean, Caroline, Sage, Daniel, Wei, Donglai, Wilden, Elizabeth, Alapatt, Deepak, Andrearczyk, Vincent, Baid, Ujjwal, Bakas, Spyridon, Balu, Niranjan, Bano, Sophia, Bawa, Vivek Singh, Bernal, Jorge, Bodenstedt, Sebastian, Casella, Alessandro, Choi, Jinwook, Commowick, Olivier, Daum, Marie, Depeursinge, Adrien, Dorent, Reuben, Egger, Jan, Eichhorn, Hannah, Engelhardt, Sandy, Ganz, Melanie, Girard, Gabriel, Hansen, Lasse, Heinrich, Mattias, Heller, Nicholas, Hering, Alessa, Huaulmé, Arnaud, Kim, Hyunjeong, Landman, Bennett, Li, Hongwei Bran, Li, Jianning, Ma, Jun, Martel, Anne, Martín-Isla, Carlos, Menze, Bjoern, Nwoye, Chinedu Innocent, Oreiller, Valentin, Padoy, Nicolas, Pati, Sarthak, Payette, Kelly, Sudre, Carole, van Wijnen, Kimberlin, Vardazaryan, Armine, Vercauteren, Tom, Wagner, Martin, Wang, Chuanbo, Yap, Moi Hoon, Yu, Zeyun, Yuan, Chun, Zenk, Maximilian, Zia, Aneeq, Zimmerer, David, Bao, Rina, Choi, Chanyeol, Cohen, Andrew, Dzyubachyk, Oleh, Galdran, Adrian, Gan, Tianyuan, Guo, Tianqi, Gupta, Pradyumna, Haithami, Mahmood, Ho, Edward, Jang, Ikbeom, Li, Zhili, Luo, Zhengbo, Lux, Filip, Makrogiannis, Sokratis, Müller, Dominik, Oh, Young-tack, Pang, Subeen, Pape, Constantin, Polat, Gorkem, Reed, Charlotte Rosalie, Ryu, Kanghyun, Scherr, Tim, Thambawita, Vajira, Wang, Haoyu, Wang, Xinliang, Xu, Kele, Yeh, Hung, Yeo, Doyeob, Yuan, Yixuan, Zeng, Yan, Zhao, Xin, Abbing, Julian, Adam, Jannes, Adluru, Nagesh, Agethen, Niklas, Ahmed, Salman, Khalil, Yasmina Al, Alenyà, Mireia, Alhoniemi, Esa, An, Chengyang, Anwar, Talha, Arega, Tewodros Weldebirhan, Avisdris, Netanell, Aydogan, Dogu Baran, Bai, Yingbin, Calisto, Maria Baldeon, Basaran, Berke Doga, Beetz, Marcel, Bian, Cheng, Bian, Hao, Blansit, Kevin, Bloch, Louise, Bohnsack, Robert, Bosticardo, Sara, Breen, Jack, Brudfors, Mikael, Brüngel, Raphael, Cabezas, Mariano, Cacciola, Alberto, Chen, Zhiwei, Chen, Yucong, Chen, Daniel Tianming, Cho, Minjeong, Choi, Min-Kook, Xie, Chuantao Xie Chuantao, Cobzas, Dana, Cohen-Adad, Julien, Acero, Jorge Corral, Das, Sujit Kumar, de Oliveira, Marcela, Deng, Hanqiu, Dong, Guiming, Doorenbos, Lars, Efird, Cory, Escalera, Sergio, Fan, Di, Serj, Mehdi Fatan, Fenneteau, Alexandre, Fidon, Lucas, Filipiak, Patryk, Finzel, René, Freitas, Nuno R., Friedrich, Christoph M., Fulton, Mitchell, Gaida, Finn, Galati, Francesco, Galazis, Christoforos, Gan, Chang Hee, Gao, Zheyao, Gao, Shengbo, Gazda, Matej, Gerats, Beerend, Getty, Neil, Gibicar, Adam, Gifford, Ryan, Gohil, Sajan, Grammatikopoulou, Maria, Grzech, Daniel, Güley, Orhun, Günnemann, Timo, Guo, Chunxu, Guy, Sylvain, Ha, Heonjin, Han, Luyi, Han, Il Song, Hatamizadeh, Ali, He, Tian, Heo, Jimin, Hitziger, Sebastian, Hong, SeulGi, Hong, SeungBum, Huang, Rian, Huang, Ziyan, Huellebrand, Markus, Huschauer, Stephan, Hussain, Mustaffa, Inubushi, Tomoo, Polat, Ece Isik, Jafaritadi, Mojtaba, Jeong, SeongHun, Jian, Bailiang, Jiang, Yuanhong, Jiang, Zhifan, Jin, Yueming, Joshi, Smriti, Kadkhodamohammadi, Abdolrahim, Kamraoui, Reda Abdellah, Kang, Inha, Kang, Junghwa, Karimi, Davood, Khademi, April, Khan, Muhammad Irfan, Khan, Suleiman A., Khantwal, Rishab, Kim, Kwang-Ju, Kline, Timothy, Kondo, Satoshi, Kontio, Elina, Krenzer, Adrian, Kroviakov, Artem, Kuijf, Hugo, Kumar, Satyadwyoom, La Rosa, Francesco, Lad, Abhi, Lee, Doohee, Lee, Minho, Lena, Chiara, Li, Hao, Li, Ling, Li, Xingyu, Liao, Fuyuan, Liao, KuanLun, Oliveira, Arlindo Limede, Lin, Chaonan, Lin, Shan, Linardos, Akis, Linguraru, Marius George, Liu, Han, Liu, Tao, Liu, Di, Liu, Yanling, Lourenço-Silva, João, Lu, Jingpei, Lu, Jiangshan, Luengo, Imanol, Lund, Christina B., Luu, Huan Minh, Lv, Yi, Lv, Yi, Macar, Uzay, Maechler, Leon, L., Sina Mansour, Marshall, Kenji, Mazher, Moona, McKinley, Richard, Medela, Alfonso, Meissen, Felix, Meng, Mingyuan, Miller, Dylan, Mirjahanmardi, Seyed Hossein, Mishra, Arnab, Mitha, Samir, Mohy-ud-Din, Hassan, Mok, Tony Chi Wing, Murugesan, Gowtham Krishnan, Karthik, Enamundram Naga, Nalawade, Sahil, Nalepa, Jakub, Naser, Mohamed, Nateghi, Ramin, Naveed, Hammad, Nguyen, Quang-Minh, Quoc, Cuong Nguyen, Nichyporuk, Brennan, Oliveira, Bruno, Owen, David, Pal, Jimut Bahan, Pan, Junwen, Pan, Wentao, Pang, Winnie, Park, Bogyu, Pawar, Vivek, Pawar, Kamlesh, Peven, Michael, Philipp, Lena, Pieciak, Tomasz, Plotka, Szymon, Plutat, Marcel, Pourakpour, Fattaneh, Preložnik, Domen, Punithakumar, Kumaradevan, Qayyum, Abdul, Queirós, Sandro, Rahmim, Arman, Razavi, Salar, Ren, Jintao, Rezaei, Mina, Rico, Jonathan Adam, Rieu, ZunHyan, Rink, Markus, Roth, Johannes, Ruiz-Gonzalez, Yusely, Saeed, Numan, Saha, Anindo, Salem, Mostafa, Sanchez-Matilla, Ricardo, Schilling, Kurt, Shao, Wei, Shen, Zhiqiang, Shi, Ruize, Shi, Pengcheng, Sobotka, Daniel, Soulier, Théodore, Fadida, Bella Specktor, Stoyanov, Danail, Mun, Timothy Sum Hon, Sun, Xiaowu, Tao, Rong, Thaler, Franz, Théberge, Antoine, Thielke, Felix, Torres, Helena, Wahid, Kareem A., Wang, Jiacheng, Wang, YiFei, Wang, Wei, Wang, Xiong, Wen, Jianhui, Wen, Ning, Wodzinski, Marek, Wu, Ye, Xia, Fangfang, Xiang, Tianqi, Xiaofei, Chen, Xu, Lizhan, Xue, Tingting, Yang, Yuxuan, Yang, Lin, Yao, Kai, Yao, Huifeng, Yazdani, Amirsaeed, Yip, Michael, Yoo, Hwanseung, Yousefirizi, Fereshteh, Yu, Shunkai, Yu, Lei, Zamora, Jonathan, Zeineldin, Ramy Ashraf, Zeng, Dewen, Zhang, Jianpeng, Zhang, Bokai, Zhang, Jiapeng, Zhang, Fan, Zhang, Huahong, Zhao, Zhongchen, Zhao, Zixuan, Zhao, Jiachen, Zhao, Can, Zheng, Qingshuo, Zhi, Yuheng, Zhou, Ziqi, Zou, Baosheng, Maier-Hein, Klaus, Jäger, Paul F., Kopp-Schneider, Annette, Maier-Hein, Lena
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
Why is the winner the best?
Eisenmann, Matthias, Reinke, Annika, Weru, Vivienn, Tizabi, Minu Dietlinde, Isensee, Fabian, Adler, Tim J., Ali, Sharib, Andrearczyk, Vincent, Aubreville, Marc, Baid, Ujjwal, Bakas, Spyridon, Balu, Niranjan, Bano, Sophia, Bernal, Jorge, Bodenstedt, Sebastian, Casella, Alessandro, Cheplygina, Veronika, Daum, Marie, de Bruijne, Marleen, Depeursinge, Adrien, Dorent, Reuben, Egger, Jan, Ellis, David G., Engelhardt, Sandy, Ganz, Melanie, Ghatwary, Noha, Girard, Gabriel, Godau, Patrick, Gupta, Anubha, Hansen, Lasse, Harada, Kanako, Heinrich, Mattias, Heller, Nicholas, Hering, Alessa, Huaulmé, Arnaud, Jannin, Pierre, Kavur, Ali Emre, Kodym, Oldřich, Kozubek, Michal, Li, Jianning, Li, Hongwei, Ma, Jun, Martín-Isla, Carlos, Menze, Bjoern, Noble, Alison, Oreiller, Valentin, Padoy, Nicolas, Pati, Sarthak, Payette, Kelly, Rädsch, Tim, Rafael-Patiño, Jonathan, Bawa, Vivek Singh, Speidel, Stefanie, Sudre, Carole H., van Wijnen, Kimberlin, Wagner, Martin, Wei, Donglai, Yamlahi, Amine, Yap, Moi Hoon, Yuan, Chun, Zenk, Maximilian, Zia, Aneeq, Zimmerer, David, Aydogan, Dogu Baran, Bhattarai, Binod, Bloch, Louise, Brüngel, Raphael, Cho, Jihoon, Choi, Chanyeol, Dou, Qi, Ezhov, Ivan, Friedrich, Christoph M., Fuller, Clifton, Gaire, Rebati Raman, Galdran, Adrian, Faura, Álvaro García, Grammatikopoulou, Maria, Hong, SeulGi, Jahanifar, Mostafa, Jang, Ikbeom, Kadkhodamohammadi, Abdolrahim, Kang, Inha, Kofler, Florian, Kondo, Satoshi, Kuijf, Hugo, Li, Mingxing, Luu, Minh Huan, Martinčič, Tomaž, Morais, Pedro, Naser, Mohamed A., Oliveira, Bruno, Owen, David, Pang, Subeen, Park, Jinah, Park, Sung-Hong, Płotka, Szymon, Puybareau, Elodie, Rajpoot, Nasir, Ryu, Kanghyun, Saeed, Numan, Shephard, Adam, Shi, Pengcheng, Štepec, Dejan, Subedi, Ronast, Tochon, Guillaume, Torres, Helena R., Urien, Helene, Vilaça, João L., Wahid, Kareem Abdul, Wang, Haojie, Wang, Jiacheng, Wang, Liansheng, Wang, Xiyue, Wiestler, Benedikt, Wodzinski, Marek, Xia, Fangfang, Xie, Juanying, Xiong, Zhiwei, Yang, Sen, Yang, Yanwu, Zhao, Zixuan, Maier-Hein, Klaus, Jäger, Paul F., Kopp-Schneider, Annette, Maier-Hein, Lena
International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To address this gap in the literature, we performed a multi-center study with all 80 competitions that were conducted in the scope of IEEE ISBI 2021 and MICCAI 2021. Statistical analyses performed based on comprehensive descriptions of the submitted algorithms linked to their rank as well as the underlying participation strategies revealed common characteristics of winning solutions. These typically include the use of multi-task learning (63%) and/or multi-stage pipelines (61%), and a focus on augmentation (100%), image preprocessing (97%), data curation (79%), and postprocessing (66%). The "typical" lead of a winning team is a computer scientist with a doctoral degree, five years of experience in biomedical image analysis, and four years of experience in deep learning. Two core general development strategies stood out for highly-ranked teams: the reflection of the metrics in the method design and the focus on analyzing and handling failure cases. According to the organizers, 43% of the winning algorithms exceeded the state of the art but only 11% completely solved the respective domain problem. The insights of our study could help researchers (1) improve algorithm development strategies when approaching new problems, and (2) focus on open research questions revealed by this work.
Effect of Prior-based Losses on Segmentation Performance: A Benchmark
Jurdi, Rosana El, Petitjean, Caroline, Cheplygina, Veronika, Honeine, Paul, Abdallah, Fahed
Today, deep convolutional neural networks (CNNs) have demonstrated state-of-the-art performance for medical image segmentation, on various imaging modalities and tasks. Despite early success, segmentation networks may still generate anatomically aberrant segmentations, with holes or inaccuracies near the object boundaries. To enforce anatomical plausibility, recent research studies have focused on incorporating prior knowledge such as object shape or boundary, as constraints in the loss function. Prior integrated could be low-level referring to reformulated representations extracted from the ground-truth segmentations, or high-level representing external medical information such as the organ's shape or size. Over the past few years, prior-based losses exhibited a rising interest in the research field since they allow integration of expert knowledge while still being architecture-agnostic. However, given the diversity of prior-based losses on different medical imaging challenges and tasks, it has become hard to identify what loss works best for which dataset. In this paper, we establish a benchmark of recent prior-based losses for medical image segmentation. The main objective is to provide intuition onto which losses to choose given a particular task or dataset. To this end, four low-level and high-level prior-based losses are selected. The considered losses are validated on 8 different datasets from a variety of medical image segmentation challenges including the Decathlon, the ISLES and the WMH challenge. Results show that whereas low-level prior-based losses can guarantee an increase in performance over the Dice loss baseline regardless of the dataset characteristics, high-level prior-based losses can increase anatomical plausibility as per data characteristics.
How I failed machine learning in medical imaging -- shortcomings and recommendations
Varoquaux, Gaël, Cheplygina, Veronika
Medical imaging is an important research field with many opportunities for improving patients' health. However, there are a number of challenges that are slowing down the progress of the field as a whole, such optimizing for publication. In this paper we reviewed several problems related to choosing datasets, methods, evaluation metrics, and publication strategies. With a review of literature and our own analysis, we show that at every step, potential biases can creep in. On a positive note, we also see that initiatives to counteract these problems are already being started. Finally we provide a broad range of recommendations on how to further these address problems in the future. For reproducibility, data and code for our analyses are available on \url{https://github.com/GaelVaroquaux/ml_med_imaging_failures}
Risk of Training Diagnostic Algorithms on Data with Demographic Bias
Abbasi-Sureshjani, Samaneh, Raumanns, Ralf, Michels, Britt E. J., Schouten, Gerard, Cheplygina, Veronika
One of the critical challenges in machine learning applications is to have fair predictions. There are numerous recent examples in various domains that convincingly show that algorithms trained with biased datasets can easily lead to erroneous or discriminatory conclusions. This is even more crucial in clinical applications where the predictive algorithms are designed mainly based on a limited or given set of medical images and demographic variables such as age, sex and race are not taken into account. In this work, we conduct a survey of the MICCAI 2018 proceedings to investigate the common practice in medical image analysis applications. Surprisingly, we found that papers focusing on diagnosis rarely describe the demographics of the datasets used, and the diagnosis is purely based on images. In order to highlight the importance of considering the demographics in diagnosis tasks, we used a publicly available dataset of skin lesions. We then demonstrate that a classifier with an overall area under the curve (AUC) of 0.83 has variable performance between 0.76 and 0.91 on subgroups based on age and sex, even though the training set was relatively balanced. Moreover, we show that it is possible to learn unbiased features by explicitly using demographic variables in an adversarial training setup, which leads to balanced scores per subgroups. Finally, we discuss the implications of these results and provide recommendations for further research.
Cats or CAT scans: transfer learning from natural or medical image source datasets?
Cheplygina, Veronika
Transfer learning is a widely used strategy in medical image analysis. Instead of only training a network with a limited amount of data from the target task of interest, we can first train the network with other, potentially larger source datasets, creating a more robust model. The source datasets do not have to be related to the target task. For a classification task in lung CT images, we could use both head CT images, or images of cats, as the source. While head CT images appear more similar to lung CT images, the number and diversity of cat images might lead to a better model overall. In this survey we review a number of papers that have performed similar comparisons. Although the answer to which strategy is best seems to be "it depends", we discuss a number of research directions we need to take as a community, to gain more understanding of this topic.