Ruusuvuori, Pekka
Foundation Models -- A Panacea for Artificial Intelligence in Pathology?
Mulliqi, Nita, Blilie, Anders, Ji, Xiaoyi, Szolnoky, Kelvin, Olsson, Henrik, Boman, Sol Erika, Titus, Matteo, Gonzalez, Geraldine Martinez, Mielcarz, Julia Anna, Valkonen, Masi, Gudlaugsson, Einar, Kjosavik, Svein R., Asenjo, José, Gambacorta, Marcello, Libretti, Paolo, Braun, Marcin, Kordek, Radzislaw, Łowicki, Roman, Hotakainen, Kristina, Väre, Päivi, Pedersen, Bodil Ginnerup, Sørensen, Karina Dalsgaard, Ulhøi, Benedicte Parm, Ruusuvuori, Pekka, Delahunt, Brett, Samaratunga, Hemamali, Tsuzuki, Toyonori, Janssen, Emilius A. M., Egevad, Lars, Eklund, Martin, Kartasalo, Kimmo
The role of artificial intelligence (AI) in pathology has evolved from aiding diagnostics to uncovering predictive morphological patterns in whole slide images (WSIs). Recently, foundation models (FMs) leveraging self-supervised pre-training have been widely advocated as a universal solution for diverse downstream tasks. However, open questions remain about their clinical applicability and generalization advantages over end-to-end learning using task-specific (TS) models. Here, we focused on AI with clinical-grade performance for prostate cancer diagnosis and Gleason grading. We present the largest validation of AI for this task, using over 100,000 core needle biopsies from 7,342 patients across 15 sites in 11 countries. We compared two FMs with a fully end-to-end TS model in a multiple instance learning framework. Our findings challenge assumptions that FMs universally outperform TS models. While FMs demonstrated utility in data-scarce scenarios, their performance converged with - and was in some cases surpassed by - TS models when sufficient labeled training data were available. Notably, extensive task-specific training markedly reduced clinically significant misgrading, misdiagnosis of challenging morphologies, and variability across different WSI scanners. Additionally, FMs used up to 35 times more energy than the TS model, raising concerns about their sustainability. Our results underscore that while FMs offer clear advantages for rapid prototyping and research, their role as a universal solution for clinically applicable medical AI remains uncertain. For high-stakes clinical applications, rigorous validation and consideration of task-specific training remain critically important. We advocate for integrating the strengths of FMs and end-to-end learning to achieve robust and resource-efficient AI pathology solutions fit for clinical use.
Deformation equivariant cross-modality image synthesis with paired non-aligned training data
Honkamaa, Joel, Khan, Umair, Koivukoski, Sonja, Valkonen, Mira, Latonen, Leena, Ruusuvuori, Pekka, Marttinen, Pekka
Cross-modality image synthesis is an active research topic with multiple medical clinically relevant applications. Recently, methods allowing training with paired but misaligned data have started to emerge. However, no robust and well-performing methods applicable to a wide range of real world data sets exist. In this work, we propose a generic solution to the problem of cross-modality image synthesis with paired but non-aligned data by introducing new deformation equivariance encouraging loss functions. The method consists of joint training of an image synthesis network together with separate registration networks and allows adversarial training conditioned on the input even with misaligned data. The work lowers the bar for new clinical applications by allowing effortless training of cross-modality image synthesis networks for more difficult data sets.
Physical Color Calibration of Digital Pathology Scanners for Robust Artificial Intelligence Assisted Cancer Diagnosis
Ji, Xiaoyi, Salmon, Richard, Mulliqi, Nita, Khan, Umair, Wang, Yinxi, Blilie, Anders, Olsson, Henrik, Pedersen, Bodil Ginnerup, Sørensen, Karina Dalsgaard, Ulhøi, Benedicte Parm, Kjosavik, Svein R, Janssen, Emilius AM, Rantalainen, Mattias, Egevad, Lars, Ruusuvuori, Pekka, Eklund, Martin, Kartasalo, Kimmo
The potential of artificial intelligence (AI) in digital pathology is limited by technical inconsistencies in the production of whole slide images (WSIs), leading to degraded AI performance and posing a challenge for widespread clinical application as fine-tuning algorithms for each new site is impractical. Changes in the imaging workflow can also lead to compromised diagnoses and patient safety risks. We evaluated whether physical color calibration of scanners can standardize WSI appearance and enable robust AI performance. We employed a color calibration slide in four different laboratories and evaluated its impact on the performance of an AI system for prostate cancer diagnosis on 1,161 WSIs. Color standardization resulted in consistently improved AI model calibration and significant improvements in Gleason grading performance. The study demonstrates that physical color calibration provides a potential solution to the variation introduced by different scanners, making AI-based cancer diagnostics more reliable and applicable in clinical settings.
Domain-specific transfer learning in the automated scoring of tumor-stroma ratio from histopathological images of colorectal cancer
Petäinen, Liisa, Väyrynen, Juha P., Ruusuvuori, Pekka, Pölönen, Ilkka, Äyrämö, Sami, Kuopio, Teijo
Tumor-stroma ratio (TSR) is a prognostic factor for many types of solid tumors. In this study, we propose a method for automated estimation of TSR from histopathological images of colorectal cancer. The method is based on convolutional neural networks which were trained to classify colorectal cancer tissue in hematoxylin-eosin stained samples into three classes: stroma, tumor and other. The models were trained using a data set that consists of 1343 whole slide images. Three different training setups were applied with a transfer learning approach using domain-specific data i.e. an external colorectal cancer histopathological data set. The three most accurate models were chosen as a classifier, TSR values were predicted and the results were compared to a visual TSR estimation made by a pathologist. The results suggest that classification accuracy does not improve when domain-specific data are used in the pre-training of the convolutional neural network models in the task at hand. Classification accuracy for stroma, tumor and other reached 96.1$\%$ on an independent test set. Among the three classes the best model gained the highest accuracy (99.3$\%$) for class tumor. When TSR was predicted with the best model, the correlation between the predicted values and values estimated by an experienced pathologist was 0.57. Further research is needed to study associations between computationally predicted TSR values and other clinicopathological factors of colorectal cancer and the overall survival of the patients.
ACROBAT -- a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology
Weitz, Philippe, Valkonen, Masi, Solorzano, Leslie, Carr, Circe, Kartasalo, Kimmo, Boissin, Constance, Koivukoski, Sonja, Kuusela, Aino, Rasic, Dusan, Feng, Yanbo, Pouplier, Sandra Kristiane Sinius, Sharma, Abhinav, Eriksson, Kajsa Ledesma, Latonen, Leena, Laenkholm, Anne-Vibeke, Hartman, Johan, Ruusuvuori, Pekka, Rantalainen, Mattias
The analysis of FFPE tissue sections stained with haematoxylin and eosin (H&E) or immunohistochemistry (IHC) is an essential part of the pathologic assessment of surgically resected breast cancer specimens. IHC staining has been broadly adopted into diagnostic guidelines and routine workflows to manually assess status and scoring of several established biomarkers, including ER, PGR, HER2 and KI67. However, this is a task that can also be facilitated by computational pathology image analysis methods. The research in computational pathology has recently made numerous substantial advances, often based on publicly available whole slide image (WSI) data sets. However, the field is still considerably limited by the sparsity of public data sets. In particular, there are no large, high quality publicly available data sets with WSIs of matching IHC and H&E-stained tissue sections. Here, we publish the currently largest publicly available data set of WSIs of tissue sections from surgical resection specimens from female primary breast cancer patients with matched WSIs of corresponding H&E and IHC-stained tissue, consisting of 4,212 WSIs from 1,153 patients. The primary purpose of the data set was to facilitate the ACROBAT WSI registration challenge, aiming at accurately aligning H&E and IHC images. For research in the area of image registration, automatic quantitative feedback on registration algorithm performance remains available through the ACROBAT challenge website, based on more than 37,000 manually annotated landmark pairs from 13 annotators. Beyond registration, this data set has the potential to enable many different avenues of computational pathology research, including stain-guided learning, virtual staining, unsupervised pre-training, artefact detection and stain-independent models.
Pathologist-Level Grading of Prostate Biopsies with Artificial Intelligence
Ström, Peter, Kartasalo, Kimmo, Olsson, Henrik, Solorzano, Leslie, Delahunt, Brett, Berney, Daniel M., Bostwick, David G., Evans, Andrew J., Grignon, David J., Humphrey, Peter A., Iczkowski, Kenneth A., Kench, James G., Kristiansen, Glen, van der Kwast, Theodorus H., Leite, Katia R. M., McKenney, Jesse K., Oxley, Jon, Pan, Chin-Chen, Samaratunga, Hemamali, Srigley, John R., Takahashi, Hiroyuki, Tsuzuki, Toyonori, Varma, Murali, Zhou, Ming, Lindberg, Johan, Bergström, Cecilia, Ruusuvuori, Pekka, Wählby, Carolina, Grönberg, Henrik, Rantalainen, Mattias, Egevad, Lars, Eklund, Martin
Background: An increasing volume of prostate biopsies and a world-wide shortage of uro-pathologists puts a strain on pathology departments. Additionally, the high intra- and inter-observer variability in grading can result in over- and undertreatment of prostate cancer. Artificial intelligence (AI) methods may alleviate these problems by assisting pathologists to reduce workload and harmonize grading. Methods: We digitized 6,682 needle biopsies from 976 participants in the population based STHLM3 diagnostic study to train deep neural networks for assessing prostate biopsies. The networks were evaluated by predicting the presence, extent, and Gleason grade of malignant tissue for an independent test set comprising 1,631 biopsies from 245 men. We additionally evaluated grading performance on 87 biopsies individually graded by 23 experienced urological pathologists from the International Society of Urological Pathology. We assessed discriminatory performance by receiver operating characteristics (ROC) and tumor extent predictions by correlating predicted millimeter cancer length against measurements by the reporting pathologist. We quantified the concordance between grades assigned by the AI and the expert urological pathologists using Cohen's kappa. Results: The performance of the AI to detect and grade cancer in prostate needle biopsy samples was comparable to that of international experts in prostate pathology. The AI achieved an area under the ROC curve of 0.997 for distinguishing between benign and malignant biopsy cores, and 0.999 for distinguishing between men with or without prostate cancer. The correlation between millimeter cancer predicted by the AI and assigned by the reporting pathologist was 0.96. For assigning Gleason grades, the AI achieved an average pairwise kappa of 0.62. This was within the range of the corresponding values for the expert pathologists (0.60 to 0.73).